Supplementary Notebook S12: Protein-Level Downstream Analysis Comparison


Requirements:

This notebook requires the 02-ApplyingProteoForge_48hr.ipynb, 02-ApplyingProteoForge_72hr.ipynb to be run first to generate the result files used here. Ensure the ./data/cleaned/ and .data/results/ directories contains the necessary input files.


Data Information:

  • Cell Line: H358 (NSCLC - Non-Small Cell Lung Cancer)
  • Conditions: True Hypoxia (1% O₂) vs Normoxia (21% O₂)
  • Timepoints: 48 hours and 72 hours
  • Quantification Methods: 4 approaches (Original, mean(top), mean(all), dPFs)
    • Original is from DIANN protein output, the other three are calculated from peptide-level data.

Purpose:

This notebook compares the protein-level quantification results from various methods, the original DIANN output of the protein-level quant version, the averaging top3 peptides, averaging all peptides, and the proteoform-informed quantification using dPFs. The goal is to assess how proteoform-aware quantification influences various metrics that we care about our protein-level data, such as technical reproducibility, differential expression detection, and biological relevance (e.g. enrichment of hypoxia-related pathways).


01. Setup

This section imports required libraries, configures display settings, and defines paths for data and figures.

Note: The HTML rendering of this notebook hides code cells by default. Click the "Code" buttons on the right to expand them.

01.1 Libraries

# Libraries Used
import os
import sys
import warnings

import numpy as np # Numerical computing
import pandas as pd # Data manipulation

import seaborn as sns # R-like high-level plots
import matplotlib.pyplot as plt # Python's base plotting 

from sklearn.decomposition import PCA

# home path
sys.path.append('../')
from src import utils, plots 
from src import tests, enrich

from ProteoForge import parse, process
from ProteoForge import impute, normalize

warnings.filterwarnings('ignore')
# Initialize the timer
startTime = utils.getTime()

01.2 Configure Notebook Settings

Configure visualization styles, color palettes, and display options for consistent output formatting throughout the analysis.

## Configure Plotting Settings
def_colors = [
    "#139593", "#fca311", "#e54f2a",
    "#c3c3c3", "#555555",
    "#690000", "#5f4a00", "#004549"
]

condition_colors = {
    'True Hypoxia (1% O2) - 48 Hr': '#e9c46a',
    'True Hypoxia (1% O2) - 72 Hr': '#f4a261',
    'Normoxia (21% O2) - 48 Hr': '#ade8f4',
    'Normoxia (21% O2) - 72 Hr': '#0077b6',
    # Won't be used since the manuscript didn't cover these conditions
    # 'Chemical Hypoxia (CoCl2) - 48 Hr': '#c77dff',
    # 'Chemical Hypoxia (CoCl2) - 72 Hr': '#8338ec',
    # 'Oxidative Stress (H2O2) - 48 Hr': '#ff006e',
    # 'Oxidative Stress (H2O2) - 72 Hr': '#fb5607',
}

method_order = ['Original', 'mean(top)', 'mean(all)',  'dPFs']

proteinMethods = {
    "Original": "#f4f1de",
    "mean(top)": "#f2cc8f",
    "mean(all)": "#81b29a",
    "dPFs": "#183a37"
}

cv_group_palettes = {
    "<10%":  "#081c15",
    "<25%":  "#1b4332",
    "<50%": "#2d6a4f",
    "<100%": "#52b788",
    ">100%": "#95d5b2",
}

# --- Define Order and Colors ---
status_colors = {
    'Unexplained': "#C2C0C0",
    'Excluded': '#565d61',
    'Upregulated': '#780000',
    'Downregulated': '#e36414',
    'Equivalent': '#003049',
}

count_type_order = list(status_colors.keys())

status_markers = {
    "Equivalent": "o",
    "Upregulated": "^",
    "Downregulated": "v",
    "Unexplained": "s",
    "Excluded": "x"
}

cond_colors = {
    'Hypoxia (1% O2)': '#f4a261',
    'Normoxia (21% O2)': '#0077b6',
}
dura_colors = {
    '48hr': '#c3c3c3',
    '72hr': '#555555',
}
cond_order = ['Normoxia (21% O2)', 'Hypoxia (1% O2)']
dura_order = ['48hr', '72hr']

# Set seaborn style
sns.set_theme(
    style="white",
    context="paper",
    palette=def_colors,
    font_scale=1,
    rc={
        "figure.figsize": (6, 4),
        "font.family": "sans-serif",
        "font.sans-serif": ["Arial", "Ubuntu Mono"],
    }
)

# Figure Saving Settings
figure_formats = ["png", "pdf"]
save_to_folder = True
transparent_bg = True
figure_dpi = 300

## Configure dataframe displaying
pd.set_option('display.float_format', lambda x: '%.4f' % x)
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 500)  # Set a wider display width


## Printing Settings
verbose = True

plots.color_palette( def_colors, save=False )
plots.color_palette( condition_colors, save=False, name='condition', size=2)
plots.color_palette( proteinMethods, save=False, name='protein level calculation method', size=2)

01.3 Data and Result Paths

  • home_path — Base project path (e.g., ./)
  • data_name — Dataset identifier used throughout ("hypoxia")
  • notebook_name — This notebook name for logging and organizing outputs ("04-Downstream")
  • data_path — Root data folder ({home_path}data/)
  • raw_path — Raw input folder ({data_path}input/{data_name}/)
  • input_path — Cleaned input data from preprocessing ({data_path}cleaned/{data_name}/)
  • output_path — ProteoForge result files ({data_path}results/{data_name}/)
  • figure_path — Destination for generated figures ({home_path}figures/{data_name}/{notebook_name}/)
home_path = './'
data_name = "hypoxia"
notebook_name = "04-Downstream"
data_path = f"{home_path}data/"
raw_path = f"{data_path}input/{data_name}/"
input_path = f"{data_path}cleaned/{data_name}/"
output_path = f"{data_path}results/{data_name}/"
figure_path = f"{home_path}figures/{data_name}/{notebook_name}/"

# Create the output folder if it does not exist
if not os.path.exists(output_path):
    os.makedirs(output_path)

# Create figure folder structure, if needed
if save_to_folder:
    for i in figure_formats:
        cur_folder = figure_path + i + "/"
        if not os.path.exists(cur_folder):
            os.makedirs(cur_folder)

01.4 Load Input Data

Load the imputed peptide-level data, protein-level quantification results from different methods, and the FASTA database for annotation mapping.

Imputed Peptide Data: Contains peptide intensities across samples after imputation. This will be the main data for three of the quantification methods.

peptide_df = pd.read_feather(f"{input_path}imputed_data.feather").reset_index()
proteins_used = peptide_df['Protein'].unique()

Fasta Table: Contains protein sequences and annotations for mapping peptides to proteins and functional information. Reading again for protein-level analyses.

fasta_data = parse.fasta_to_df(
    fasta_path = f'./data/input/20230316_HumanCr_20806.fasta',
    gene_only = False,
    column_order=['entry', 'geneName', 'proteinDescription', 'sequenceLength', 'molecularWeight_kDa', 'sequence']
).rename(
    columns={
        'entry': 'Protein',
        'geneName': 'Gene',
        'proteinDescription': 'Description',
        'sequenceLength': 'Length',
        'molecularWeight_kDa': 'Weight(kDa)',
        'sequence': 'Sequence'
    }
)
fasta_data.head()
Processing FASTA file: ./data/input/20230316_HumanCr_20806.fasta
Auto-selected: multiprocessing with 4 processes
Loading FASTA entries...
Processing 26855 entries in chunks of 2828
Processed 26772 entries, skipped 83
  - invalid_sequence: 78
  - length_filter: 5
Generated DataFrame with 26772 rows and 6 columns
Protein Gene Description Length Weight(kDa) Sequence
0 A0A024B7W1 None Genome polyprotein 3423 378.8692 MKNPKKKSGGFRIVNMLKRGVARVSPFGGLKRLPAGLLLGHGPIRM...
1 A0A024R1R8 TMA7B Translation machinery-associated protein 7B 64 7.0870 MSSHEGGKKKALKQPKKQAKEMDEEEKAFKQKQKEEQKKLEVLKAK...
2 A0A024RBG1 NUDT4B Diphosphoinositol polyphosphate phosphohydrola... 181 20.4213 MMKFKPNQTRTYDREGFKKRAACLCFRSEQEDEVLLVSSSRYPDQW...
3 A0A075B6H7 IGKV3-7 Probable non-functional immunoglobulin kappa v... 116 12.7754 MEAPAQLLFLLLLWLPDTTREIVMTQSPPTLSLSPGERVTLSCRAS...
4 A0A075B6H8 IGKV1D-42 Probable non-functional immunoglobulin kappa v... 117 13.0056 MDMRVPAQLLGLLLLWLPGVRFDIQMTQSPSFLSASVGDRVSIICW...

Original Protein Quant Data: DIANN output with protein-level quantification.

original_protein = pd.read_csv(f"{raw_path}report.pg_matrix.tsv", sep="\t", engine='pyarrow')
meta = original_protein.columns[4:].to_list()
# Build a meta dataframe
meta = pd.DataFrame(
    data=meta,
    columns=['filename']
)
# --- Clean and Standardize Metadata ---
print("\n🧹 Cleaning and standardizing metadata columns...")

meta['Run'] = meta['filename'].str.split(r'\\').str[-1].str.replace('.d', '')
# Extract condition identifier
def get_condition(x):
    if '1%' in x:
        return 'True Hypoxia (1% O2)'
    elif 'CoCl' in x:
        return 'Chemical Hypoxia (CoCl2)'
    elif 'H2O2' in x:
        return 'Oxidative Stress (H2O2)'
    else:
        return 'Normoxia (21% O2)'
meta['Condition'] = meta['Run'].map(get_condition)
# Extract the replicate number (special rule for 1% O2)
meta['Replicate'] = meta['Run'].str.split('_').apply(lambda x: x[2] if '1%' in x[1] else x[1][-1])
meta['Duration'] = meta['Run'].str.split('_').str[-5] # last 5th element from the end
meta['Colname'] = meta['Run'].str.split('_').map(
    # Simplify the colname based on the condition
    lambda x: f"{x[1]}_{x[3]}_{x[2]}" if '1%' in x[1] else f"{x[1][:-1]}_{x[2]}_{x[1][-1]}"
)
# --- Finalize DataFrame ---
print("Selecting final columns and renaming for clarity...")
print("\n✅ Metadata processing complete.")
print(f"Final DataFrame shape: {meta.shape}")
print(f"Final columns: {meta.columns.tolist()}")
print()

# Sort the metadata DataFrame for better readability
meta = meta.sort_values(
    ['Condition', 'Duration', 'Replicate'], 
    ascending=[False, True, True]
).reset_index(drop=True)

# Keep only the relevant conditions for the study
relevant_conditions = [
    'True Hypoxia (1% O2)',
    'Normoxia (21% O2)',
]
print(f"Filtering metadata to keep only relevant conditions: {relevant_conditions}")
meta = meta[meta['Condition'].isin(relevant_conditions)].reset_index(drop=True)
print(f"Filtered DataFrame shape: {meta.shape}")
meta
🧹 Cleaning and standardizing metadata columns...
Selecting final columns and renaming for clarity...

✅ Metadata processing complete.
Final DataFrame shape: (32, 6)
Final columns: ['filename', 'Run', 'Condition', 'Replicate', 'Duration', 'Colname']

Filtering metadata to keep only relevant conditions: ['True Hypoxia (1% O2)', 'Normoxia (21% O2)']
Filtered DataFrame shape: (16, 6)
filename Run Condition Replicate Duration Colname
0 D:\Data\Tam\RevisionLungDIA\H358\H358_1%_1_48_... H358_1%_1_48_DIA_BB4_1_5432 True Hypoxia (1% O2) 1 48 1%_48_1
1 D:\Data\Tam\RevisionLungDIA\H358\H358_1%_2_48_... H358_1%_2_48_DIA_BC4_1_5440 True Hypoxia (1% O2) 2 48 1%_48_2
2 D:\Data\Tam\RevisionLungDIA\H358\H358_1%_3_48_... H358_1%_3_48_DIA_BD4_1_5448 True Hypoxia (1% O2) 3 48 1%_48_3
3 D:\Data\Tam\RevisionLungDIA\H358\H358_1%_4_48_... H358_1%_4_48_DIA_BE4_1_5456 True Hypoxia (1% O2) 4 48 1%_48_4
4 D:\Data\Tam\RevisionLungDIA\H358\H358_1%_1_72_... H358_1%_1_72_DIA_BB8_1_5436 True Hypoxia (1% O2) 1 72 1%_72_1
5 D:\Data\Tam\RevisionLungDIA\H358\H358_1%_2_72_... H358_1%_2_72_DIA_BC8_1_5444 True Hypoxia (1% O2) 2 72 1%_72_2
6 D:\Data\Tam\RevisionLungDIA\H358\H358_1%_3_72_... H358_1%_3_72_DIA_BD8_1_5452 True Hypoxia (1% O2) 3 72 1%_72_3
7 D:\Data\Tam\RevisionLungDIA\H358\H358_1%_4_72_... H358_1%_4_72_DIA_BE8_1_5460 True Hypoxia (1% O2) 4 72 1%_72_4
8 D:\Data\Tam\RevisionLungDIA\H358\H358_C1_48_DI... H358_C1_48_DIA_BB1_1_5429 Normoxia (21% O2) 1 48 C_48_1
9 D:\Data\Tam\RevisionLungDIA\H358\H358_C2_48_DI... H358_C2_48_DIA_BC1_1_5437 Normoxia (21% O2) 2 48 C_48_2
10 D:\Data\Tam\RevisionLungDIA\H358\H358_C3_48_DI... H358_C3_48_DIA_BD1_1_5445 Normoxia (21% O2) 3 48 C_48_3
11 D:\Data\Tam\RevisionLungDIA\H358\H358_C4_48_DI... H358_C4_48_DIA_BE1_1_5453 Normoxia (21% O2) 4 48 C_48_4
12 D:\Data\Tam\RevisionLungDIA\H358\H358_C1_72_DI... H358_C1_72_DIA_BB5_1_5433 Normoxia (21% O2) 1 72 C_72_1
13 D:\Data\Tam\RevisionLungDIA\H358\H358_C2_72_DI... H358_C2_72_DIA_BC5_1_5441 Normoxia (21% O2) 2 72 C_72_2
14 D:\Data\Tam\RevisionLungDIA\H358\H358_C3_72_DI... H358_C3_72_DIA_BD5_1_5449 Normoxia (21% O2) 3 72 C_72_3
15 D:\Data\Tam\RevisionLungDIA\H358\H358_C4_72_DI... H358_C4_72_DIA_BE5_1_5457 Normoxia (21% O2) 4 72 C_72_4
cv_datas = []
# Build a dictionary to match condition_colors keys: 'Condition - Duration Hr'
condition_to_samples = {}
for condition in meta['Condition'].unique():
    for duration in sorted(meta.loc[meta['Condition'] == condition, 'Duration'].unique()):
        key = f"{condition} - {duration} Hr"
        samples = meta.loc[(meta['Condition'] == condition) & (meta['Duration'] == duration), 'Colname'].tolist()
        condition_to_samples[key] = samples

# samples_to_condition mapping
samples_to_condition = {sample: condition for condition, samples in condition_to_samples.items() for sample in samples}
# samples_to_colors mapping
samples_to_colors = {sample: condition_colors[condition] for sample, condition in samples_to_condition.items()}

02. Protein Quantification Methods Comparison

This section loads and processes protein-level intensities using four different quantification strategies. For each method, we calculate coefficient of variation (CV) to assess reproducibility and perform PCA to visualize sample separation.

02.1 Original DIANN Protein Intensities

Load protein intensities as reported by DIA-NN software. This represents the default protein-level quantification without additional processing. The key representing this method will be Original.

Re-doing some of the data cleanups from first notebook for consistency for protein-level version for consistency.

Loading the Data

original_protein = pd.read_csv(f"{raw_path}report.pg_matrix.tsv", sep="\t", engine='pyarrow')
original_protein = original_protein.drop(
    columns=['Protein.Names', 'Genes', 'First.Protein.Description']
).rename(
    columns={"Protein.Group": "Protein"}
)
# Display summary information
print(f"Data shape (wide format): {original_protein.shape}")


print("\n--- 🧬 Protein ID Representation ---")
# Check if the protein contains '-' indicating it has isoforms
has_isoform = original_protein['Protein'].str.contains('-').any()
has_multiples = original_protein['Protein'].str.contains(';').any()
# Friendly print statements to explain the protein ID representation
if has_isoform:
    print("The dataset contains protein IDs with isoforms (e.g., 'P12345-2'). These represent specific variants of a protein.")
else:
    print("The dataset does not contain protein IDs with isoforms.")

if has_multiples:
    print("The dataset contains protein IDs with multiple entries (e.g., 'P12345; P67890'). These represent proteins that may have multiple forms or related proteins.")
    print()
    print("Will explode these entries to create a more detailed DataFrame.")
    original_protein = process.explode_aligned(original_protein, ['Protein'], sep=';', verbose=verbose)
    print()
    print(f"After exploding, the dataset now has {len(original_protein)} rows and {len(original_protein.columns)} columns.")
else:
    print("The dataset does not contain protein IDs with multiple entries.")

print(f"Values below 2 should be set as NaN, they might be artifacts of quantification.")
print(f"Number of missing values in the DataFrame before cleanup: {original_protein.isnull().sum().sum()}")
# Identify numeric columns in the DataFrame
numeric_columns = original_protein.select_dtypes(include=[np.number]).columns
original_protein[numeric_columns] = original_protein[numeric_columns].mask(
    original_protein[numeric_columns] < 2
)
print(f"Number of missing values in the DataFrame after cleanup: {original_protein.isnull().sum().sum()}")

original_protein = original_protein.set_index('Protein')[meta['filename']]
print("Set cleaned column names from metadata")
original_protein.columns = meta['Colname']
original_protein = original_protein.sort_index()
original_protein.head()
Data shape (wide format): (8918, 33)

--- 🧬 Protein ID Representation ---
The dataset does not contain protein IDs with isoforms.
The dataset contains protein IDs with multiple entries (e.g., 'P12345; P67890'). These represent proteins that may have multiple forms or related proteins.

Will explode these entries to create a more detailed DataFrame.
🚀 Starting aligned explosion for columns: ['Protein']
Original DataFrame shape: (8918, 33)
Final DataFrame shape: (9029, 33)
✅ Transformation complete in 0.0932 seconds.

After exploding, the dataset now has 9029 rows and 33 columns.
Values below 2 should be set as NaN, they might be artifacts of quantification.
Number of missing values in the DataFrame before cleanup: 12877
Number of missing values in the DataFrame after cleanup: 12877
Set cleaned column names from metadata
Colname 1%_48_1 1%_48_2 1%_48_3 1%_48_4 1%_72_1 1%_72_2 1%_72_3 1%_72_4 C_48_1 C_48_2 C_48_3 C_48_4 C_72_1 C_72_2 C_72_3 C_72_4
Protein
A0A024RBG1 106932.0000 114285.0000 101081.0000 120276.0000 124488.0000 109649.0000 104434.0000 92969.7000 125823.0000 106866.0000 115710.0000 137117.0000 110514.0000 138896.0000 124320.0000 104223.0000
A0A096LP01 9701.4100 6065.4900 8416.2600 10267.9000 NaN 8381.3700 8866.8700 8336.3700 11194.8000 14924.0000 14774.3000 12868.4000 12249.5000 13873.9000 13091.8000 16304.4000
A0A0B4J2D5 239693.0000 223978.0000 224786.0000 254631.0000 174872.0000 262457.0000 267884.0000 242496.0000 241334.0000 177731.0000 210887.0000 269207.0000 233719.0000 223538.0000 204650.0000 169070.0000
A0A0B4J2F0 6286.4600 NaN 20654.8000 22783.6000 22998.9000 22887.6000 26960.6000 19887.2000 NaN 10762.5000 19256.9000 23658.6000 12704.9000 11545.9000 20380.3000 23297.6000
A0A2Z4LIS9 4035.0200 2587.3300 2064.8100 2414.1500 5768.8800 2343.8100 2189.9700 2148.6600 1615.3900 3101.2300 2665.9200 3850.5500 2414.3300 3994.3500 2486.6600 2599.2100

02.1.1 Data Cleaning and Normalization

Clean the protein intensity data through:

  • Outlier sample detection using combined metric scoring (correlation, quantification ratio, CV)
  • Removal of systematically aberrant samples
  • Median centering normalization
  • Missing value imputation using a multi-step pipeline

The same cleaning and normalization steps as applied to peptide-level data are applied here to ensure consistency across quantification methods.

metric_weights = {  # Weights for the metrics used in pairwise combined score calculation
    'correlation': 1.0,
    'quantification': 1.0,
    'cv': 1.0
}
minOutlier_occurrance = 3 # minimum number of outlier occurrences to consider a sample behaves like an outlier

conditions_to_check = list(condition_to_samples.keys())  # List of conditions to check for outlier detection

print("\nCalculating Pairwise Combined Scores...")
print(" - Using metric weights:", metric_weights)
print(" - Using Spearman correlation, quantification as ratio, and CV as 1 - CV")

all_outlier_samples, combined_group_reports = utils.identify_outlier_samples(
    data=original_protein,
    sample_groups=condition_to_samples,
    analysis_func=utils.run_metric_combination_analysis,
    min_outlier_occurrence=minOutlier_occurrance,
    verbose=False
)
print(f"\nComplete list of unique outlier samples to remove (after min_outlier_occurrence >= {minOutlier_occurrance})\n - {all_outlier_samples}")
print()

plot_data = original_protein.reset_index().melt(
    id_vars='Protein',
    var_name='Sample',
    value_name='Intensity'
).dropna().assign(log2_Intensity=lambda x: np.log2(x['Intensity']))
fig, ax = plt.subplots(figsize=(12, 4))
sns.boxplot(
    x='Sample', y='log2_Intensity', hue='Sample', data=plot_data, ax=ax, 
    palette=samples_to_colors,
    fliersize=0.5,
    linewidth=0.5,
    notch=False,
    showmeans=True,
    meanprops={"marker": "o", "markerfacecolor": "white", "markeredgecolor": "black"},
    medianprops={"color": "white", "linewidth": 2.5},
    whiskerprops={"color": "black", "linewidth": 1},
    capprops={"color": "black", "linewidth": 0},
    legend=False
)
y_min = 0
y_max = plot_data['log2_Intensity'].max() + 1
ax.set_ylim(y_min, y_max)
ax.tick_params(axis='x', labelrotation=90)
ax.set_title('Boxplot of Log2 Intensities Across Samples', fontsize=16, fontweight='bold')
ax.set_xlabel('Sample', fontsize=14)
ax.set_ylabel('Log2(Intensity)', fontsize=14)
ax.grid(which='both', linestyle='--', linewidth=0.5, color='lightgray', alpha=0.7)

import matplotlib.patches as mpatches

# Highlight background for outlier samples
for i, label in enumerate(ax.get_xticklabels()):
    sample_label = label.get_text()
    if sample_label in all_outlier_samples:
        # Add a rectangle spanning the y-range at the x position
        ax.add_patch(
            mpatches.Rectangle(
                (i - 0.5, y_min),  # x position, y position
                1,                 # width (one boxplot)
                y_max - y_min,     # height
                color='#ffcccc',   # light red
                zorder=0,          # behind the boxplot
                alpha=0.5
            )
        )
# Add legend for outlier highlight
outlier_patch = mpatches.Patch(color='#ffcccc', label='Identified Outlier Sample', alpha=0.5)
ax.legend(handles=[outlier_patch], loc='upper right', fontsize=12)
plt.tight_layout()
plt.show()
# Remove the identified outlier samples from the main data
print("\nRemoving identified outlier samples from the main data...")
print(f"- Samples to remove: {all_outlier_samples}")
print(f"- Data shape before removal: {original_protein.shape}")
cleaned_data = original_protein.drop(columns=all_outlier_samples)
print(f"- Data shape after removal: {cleaned_data.shape}")
print("Outlier removal complete.")
Calculating Pairwise Combined Scores...
 - Using metric weights: {'correlation': 1.0, 'quantification': 1.0, 'cv': 1.0}
 - Using Spearman correlation, quantification as ratio, and CV as 1 - CV

Complete list of unique outlier samples to remove (after min_outlier_occurrence >= 3)
 - ['1%_72_1', 'C_72_4']

Removing identified outlier samples from the main data...
- Samples to remove: ['1%_72_1', 'C_72_4']
- Data shape before removal: (9029, 16)
- Data shape after removal: (9029, 14)
Outlier removal complete.
meta = meta[~meta['Colname'].isin(all_outlier_samples)].reset_index(drop=True)
# Build a dictionary to match condition_colors keys: 'Condition - Duration Hr'
condition_to_samples = {}
for condition in meta['Condition'].unique():
    for duration in sorted(meta.loc[meta['Condition'] == condition, 'Duration'].unique()):
        key = f"{condition} - {duration} Hr"
        samples = meta.loc[(meta['Condition'] == condition) & (meta['Duration'] == duration), 'Colname'].tolist()
        condition_to_samples[key] = samples

# samples_to_condition mapping
samples_to_condition = {sample: condition for condition, samples in condition_to_samples.items() for sample in samples}
# samples_to_colors mapping
samples_to_colors = {sample: condition_colors[condition] for sample, condition in samples_to_condition.items()}
centered_data = normalize.by_median_centering(
    df=cleaned_data,
    rescale_to_original_magnitude=True,
    condition_map=None, # Pass dictionary if want to apply condition-specific centering
    verbose=verbose
)

imputation_pipeline_scheme = [
    { # First is the remove sparsely quantified features to set complete missing for low-value imputation 
        'method': 'amputate',
        'params': { 'min_quantified': 1 }
    },
    # # I generally don't use this step, but it can be useful for some datasets
    # { # Second step is generally to fill the sparse missingness with features' mean or median 
    #     'method': 'fill_dense', 
    #     'params': {'max_missing': 1, 'strategy': 'mean'}
    # },
    # This is the best for complete or mostly missing features
    { # Adds downshifted low-values to completely missing (1.0) or mostly missing (< 1.0) features
        'method': 'downshift',
        'params': {'missingness_threshold': .5, 'shift_magnitude': 2.5, 'low_percentile': 0.05}
    },
    { # Final step is to use k-NN imputation to fill the remaining missing values
        'method': 'knn',
        'params': {'n_neighbors': 4}
    }
]

imputed_data = impute.run_imputation_pipeline(
    data=centered_data,
    cond_dict=condition_to_samples,
    scheme=imputation_pipeline_scheme,
    is_log2=False,  
    return_log2=False,
    verbose=verbose
)
original_protein = imputed_data
Applying Median Centering...
  - Applying global median centering...
    - Rescaling by a global factor of: 32768.00
  - Data after median centering has shape: (9029, 14)
  - Missing values (Before/After): (3319/3319)
====== Starting Imputation Pipeline ======
Initial missing values: 3319

--- Running Step: Sparse Feature Amputation ---
  Parameters: min_quantified=1
  Summary: Created 0 new missing values by amputation.
Total missing values after step 1 ('amputate'): 3319

--- Running Step: Downshifted Imputation ---
  Parameters: missingness_threshold=0.5, shift_magnitude=2.5, low_percentile=0.05
  Condition 'True Hypoxia (1% O2) - 48 Hr': Imputing 257 features with >= 2 missing values.
  Condition 'True Hypoxia (1% O2) - 72 Hr': Imputing 227 features with >= 2 missing values.
  Condition 'Normoxia (21% O2) - 48 Hr': Imputing 243 features with >= 2 missing values.
  Condition 'Normoxia (21% O2) - 72 Hr': Imputing 184 features with >= 2 missing values.
  Summary: Imputed 2464 missing values.
Total missing values after step 2 ('downshift'): 855

--- Running Step: k-NN Imputation ---
  Parameters: n_neighbors=4
  Condition 'True Hypoxia (1% O2) - 48 Hr': Applying k-NN to 211 remaining missing values.
  Condition 'True Hypoxia (1% O2) - 72 Hr': Applying k-NN to 210 remaining missing values.
  Condition 'Normoxia (21% O2) - 48 Hr': Applying k-NN to 179 remaining missing values.
  Condition 'Normoxia (21% O2) - 72 Hr': Applying k-NN to 255 remaining missing values.
  Summary: Imputed 855 missing values.
Total missing values after step 3 ('knn'): 0

====== Imputation Pipeline Finished ======

02.1.2 Coefficient of Variation (CV) Analysis

The CVs will be calculated across conditions at this method and will be stored in a cv_datas table to create comparison plots later on.

print("\nCalculating Coefficient of Variation (CV) from proteins and group by condition...")

cv_data = pd.DataFrame(index = original_protein.index)
# Calculate the coefficient of variation per biological sample
for k, v in condition_to_samples.items():
    if not all(item in original_protein.columns for item in v):
        continue
    cv_data[k] = utils.cv_numpy(
        original_protein[v].values,
        axis=1,
        ignore_nan=True,
        format="percent"
    )

plot_data = utils.create_cv_group_plot_data(
    cv_data=cv_data,
    cv_group_palettes=cv_group_palettes
).dropna(subset=['CV'])
plot_data['Method'] = 'Original'
cv_datas.append(plot_data)
Calculating Coefficient of Variation (CV) from proteins and group by condition...

02.1.3 PCA Visualization

Perform PCA on the cleaned protein intensities to visualize sample clustering and separation between hypoxia and normoxia conditions.

# PCA
scaler = "zscore"
complete_data = original_protein
scaled_data = utils.scale_the_data(
    complete_data,
    method=scaler,
    axis=1,
    is_log=False
).fillna(0)

# Run PCA
pca = PCA()
pca_res = pca.fit_transform(scaled_data.T)
pca_exp_var = pca.explained_variance_ratio_

# Set Variables
C1, C2 = 0, 1
hue_group = "Condition"
style_group = "Duration"
comp_locs = [C1, C2]
comp_names = [f'PC{c+1}' for c in comp_locs]
xlabel = f'{comp_names[0]} ({pca_exp_var[comp_locs[0]]:.2%} of variance)'
ylabel = f'{comp_names[1]} ({pca_exp_var[comp_locs[1]]:.2%} of variance)'

# Build the PCA Plot Data
plot_data = pd.DataFrame(pca_res[:, comp_locs], columns=comp_names)
plot_data['UniqueNames'] = complete_data.columns
plot_data['Condition'] = plot_data['UniqueNames'].map(samples_to_condition)
plot_data['Duration'] = plot_data['UniqueNames'].str.split('_').str[1]

# Find Centroids for each group (Condition and Duration)
centroid_data = plot_data.groupby([hue_group, style_group])[comp_names].mean().reset_index()

# Prepare color/marker mapping for unique (hue, style) combinations
unique_groups = centroid_data[[hue_group, style_group]].drop_duplicates()
group_palette = {}
group_marker = {}
markers = ["o", "s", "D", "^", "v", "P", "X", "*"]
for idx, (cond, dura) in enumerate(unique_groups.values):
    color = condition_colors.get(cond, "#888888")
    marker = markers[idx % len(markers)]
    group_palette[(cond, dura)] = color
    group_marker[(cond, dura)] = marker

fig, ax = plt.subplots(figsize=(9, 7))
fig.patch.set_facecolor('white')

# Plot technical replicates
for (cond, dura), group_df in plot_data.groupby([hue_group, style_group]):
    color = group_palette.get((cond, dura), "#888888")
    marker = group_marker.get((cond, dura), "o")
    ax.scatter(
        group_df[comp_names[0]], group_df[comp_names[1]],
        label=f"{cond}, {dura}hr",
        color=color, marker=marker, s=80, edgecolor='k', alpha=0.85, linewidth=1.2, zorder=2
    )
    # Confidence ellipse
    if len(group_df) > 1:
        plots.confidence_ellipse(
            group_df[comp_names[0]], group_df[comp_names[1]], ax, n_std=2,
            edgecolor=color, facecolor=color, alpha=0.18, linewidth=1.5, zorder=1
        )

# Plot centroids
for (cond, dura), group_df in centroid_data.groupby([hue_group, style_group]):
    color = group_palette.get((cond, dura), "#888888")
    marker = group_marker.get((cond, dura), "o")
    ax.scatter(
        group_df[comp_names[0]], group_df[comp_names[1]],
        color=color, marker=marker, s=260, edgecolor='k', linewidth=2.5, alpha=1, zorder=3
    )

# Axis limits and grid
minVal = plot_data[comp_names].min().min()
maxVal = plot_data[comp_names].max().max()
limit = max(abs(minVal), abs(maxVal))
offset = 0.1 * limit
limit += offset
ax.set_xlim(-limit, limit)
ax.set_ylim(-limit, limit)
ax.grid(True, which="both", linestyle="--", linewidth=0.7, alpha=0.4, color="grey")

# 0 lines
ax.axvline(0, color='grey', linestyle='--', linewidth=1.2, alpha=0.6, zorder=0)
ax.axhline(0, color='grey', linestyle='--', linewidth=1.2, alpha=0.6, zorder=0)

# Labels and title
ax.set_xlabel(xlabel, fontsize=14, fontweight="bold")
ax.set_ylabel(ylabel, fontsize=14, fontweight="bold")
ax.set_title(
    f"PCA of Original Protein Intensities (n={complete_data.shape[0]})",
    fontsize=17, fontweight="bold", pad=15, loc="left"
)

# Custom legend: combine color and marker
from matplotlib.lines import Line2D
legend_elements = [
    Line2D(
        [0], [0],
        marker=group_marker[(cond, dura)],
        color='w',
        markerfacecolor=group_palette[(cond, dura)],
        markeredgecolor='k',
        markersize=13,
        linewidth=0,
        label=f"{cond}, {dura}hr"
    )
    for (cond, dura) in unique_groups.values
]
ax.legend(
    handles=legend_elements,
    title="Condition, Duration",
    bbox_to_anchor=(1.02, 1),
    loc="upper left",
    frameon=False,
    fontsize=11,
    title_fontsize=12,
    borderaxespad=0.5
)

sns.despine(ax=ax, left=True, bottom=True)
plt.tight_layout()
plots.finalize_plot(
    fig, show=True, save=save_to_folder, 
    filename='PCA_Original_Proteins',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)

The PCA plot shows the hypoxia and normoxia samples clustering distinctly, indicating that the original DIANN protein quantification captures biological differences effectively. However there are major overlaps between timepoints showing that timepoint differences are not well captured especially for normoxia samples.

The Explained variance is captured for the first two principal components for later comparison against other methods.


02.2 Mean of Top-3 Peptides per Protein

Calculate protein intensities by averaging the three most abundant peptides per protein. This approach reduces the influence of low-abundance or poorly quantified peptides.

02.2.1 Calculating Protein Intensities from Top-3 Peptides

peptide_df = pd.read_feather(f"{input_path}imputed_data.feather").reset_index()

# Ensure the first two columns are 'Protein' and 'unique_id', rest are samples
sample_cols = peptide_df.columns[2:]
# Calculate mean intensity for each peptide across all samples
peptide_df['mean_intensity'] = peptide_df[sample_cols].mean(axis=1, skipna=True)
# Sort by Protein and mean_intensity descending
peptide_df_sorted = peptide_df.sort_values(['Protein', 'mean_intensity'], ascending=[True, False])

# Select top 3 peptides per protein (vectorized)
top3_peptides = peptide_df_sorted.groupby('Protein', group_keys=False).head(3)

# Group by Protein and take the mean of sample columns (vectorized)
protein_top3 = top3_peptides.groupby('Protein')[sample_cols].mean()

print(f'Protein-level intensities (Top3 mean) shape: {protein_top3.shape}')

print(f"\n--- 🔄 Median Centering Normalization ---")
protein_top3 = normalize.by_median_centering(
    df=protein_top3,
    rescale_to_original_magnitude=True,
    condition_map=None,
    verbose=verbose
)

cv_data = pd.DataFrame(index = protein_top3.index)
# Calculate the coefficient of variation per biological sample
for k, v in condition_to_samples.items():
    # if not all(item in protein_top3.columns for item in v):
    #     continue
    cv_data[k] = utils.cv_numpy(
        protein_top3[v].values,
        axis=1,
        ignore_nan=True,
        format="percent"
    )

plot_data = utils.create_cv_group_plot_data(
    cv_data=cv_data,
    cv_group_palettes=cv_group_palettes
).dropna(subset=['CV'])
plot_data['Method'] = 'mean(top)'
cv_datas.append(plot_data)

protein_top3.head()
Protein-level intensities (Top3 mean) shape: (7161, 14)

--- 🔄 Median Centering Normalization ---

Applying Median Centering...
  - Applying global median centering...
    - Rescaling by a global factor of: 32768.00
  - Data after median centering has shape: (7161, 14)
  - Missing values (Before/After): (0/0)
Colname 1%_48_1 1%_48_2 1%_48_3 1%_48_4 1%_72_2 1%_72_3 1%_72_4 C_48_1 C_48_2 C_48_3 C_48_4 C_72_1 C_72_2 C_72_3
Protein
A0A024RBG1 42434.0877 44675.9925 39905.7199 45166.4723 41072.9743 41509.6032 37214.0846 47803.0084 42063.5809 40992.1971 48181.0230 43600.5230 53923.3764 48616.6057
A0A0B4J2D5 102861.7619 109717.8950 96666.4961 104414.6932 113987.2439 108450.8661 105768.6036 114020.7478 80482.5059 88493.0615 109336.0082 100739.3204 98188.1377 84013.4979
A0A1B0GTU1 15418.3217 13016.2592 20110.4711 16653.0423 10068.9798 10815.8631 7411.4047 18067.9728 15565.7734 14406.1895 15852.1724 16550.5883 21497.2074 21683.5016
A0A1W2PQL4 138883.8369 157641.7370 92956.1284 144272.3660 181450.5185 186647.1011 164358.8932 148590.3989 165077.2871 170214.6665 166396.9636 141739.9102 158388.9094 174016.5392
A0A2R8Y4L2 2516111.9537 2101575.8857 2084923.5383 1978299.3752 2458649.4794 2321907.3483 1817408.9804 2162122.0254 2094417.0202 2240939.2056 2186368.9643 2425103.1440 1965437.4835 1998541.0596

02.2.2 PCA Visualization

Perform PCA on the cleaned protein intensities to visualize sample clustering and separation between hypoxia and normoxia conditions.

# PCA
scaler = "zscore"
complete_data = protein_top3.fillna(0)
scaled_data = utils.scale_the_data(
    complete_data,
    method=scaler,
    axis=1,
    is_log=False
)

# Run PCA
pca = PCA()
pca_res = pca.fit_transform(scaled_data.T)
pca_exp_var = pca.explained_variance_ratio_

# Set Variables
C1, C2 = 0, 1
hue_group = "Condition"
style_group = "Duration"
comp_locs = [C1, C2]
comp_names = [f'PC{c+1}' for c in comp_locs]
xlabel = f'{comp_names[0]} ({pca_exp_var[comp_locs[0]]:.2%} of variance)'
ylabel = f'{comp_names[1]} ({pca_exp_var[comp_locs[1]]:.2%} of variance)'

# Build the PCA Plot Data
plot_data = pd.DataFrame(pca_res[:, comp_locs], columns=comp_names)
plot_data['UniqueNames'] = complete_data.columns
plot_data['Condition'] = plot_data['UniqueNames'].map(samples_to_condition)
plot_data['Duration'] = plot_data['UniqueNames'].str.split('_').str[1]

# Find Centroids for each group (Condition and Duration)
centroid_data = plot_data.groupby([hue_group, style_group])[comp_names].mean().reset_index()

# Prepare color/marker mapping for unique (hue, style) combinations
unique_groups = centroid_data[[hue_group, style_group]].drop_duplicates()
group_palette = {}
group_marker = {}
markers = ["o", "s", "D", "^", "v", "P", "X", "*"]
for idx, (cond, dura) in enumerate(unique_groups.values):
    color = condition_colors.get(cond, "#888888")
    marker = markers[idx % len(markers)]
    group_palette[(cond, dura)] = color
    group_marker[(cond, dura)] = marker

fig, ax = plt.subplots(figsize=(9, 7))
fig.patch.set_facecolor('white')

# Plot technical replicates
for (cond, dura), group_df in plot_data.groupby([hue_group, style_group]):
    color = group_palette.get((cond, dura), "#888888")
    marker = group_marker.get((cond, dura), "o")
    ax.scatter(
        group_df[comp_names[0]], group_df[comp_names[1]],
        label=f"{cond}, {dura}hr",
        color=color, marker=marker, s=80, edgecolor='k', alpha=0.85, linewidth=1.2, zorder=2
    )
    # Confidence ellipse
    if len(group_df) > 1:
        plots.confidence_ellipse(
            group_df[comp_names[0]], group_df[comp_names[1]], ax, n_std=2,
            edgecolor=color, facecolor=color, alpha=0.18, linewidth=1.5, zorder=1
        )

# Plot centroids
for (cond, dura), group_df in centroid_data.groupby([hue_group, style_group]):
    color = group_palette.get((cond, dura), "#888888")
    marker = group_marker.get((cond, dura), "o")
    ax.scatter(
        group_df[comp_names[0]], group_df[comp_names[1]],
        color=color, marker=marker, s=260, edgecolor='k', linewidth=2.5, alpha=1, zorder=3
    )

# Axis limits and grid
minVal = plot_data[comp_names].min().min()
maxVal = plot_data[comp_names].max().max()
limit = max(abs(minVal), abs(maxVal))
offset = 0.1 * limit
limit += offset
ax.set_xlim(-limit, limit)
ax.set_ylim(-limit, limit)
ax.grid(True, which="both", linestyle="--", linewidth=0.7, alpha=0.4, color="grey")

# 0 lines
ax.axvline(0, color='grey', linestyle='--', linewidth=1.2, alpha=0.6, zorder=0)
ax.axhline(0, color='grey', linestyle='--', linewidth=1.2, alpha=0.6, zorder=0)

# Labels and title
ax.set_xlabel(xlabel, fontsize=14, fontweight="bold")
ax.set_ylabel(ylabel, fontsize=14, fontweight="bold")
ax.set_title(
    f"PCA of Averaging Top Peptides to Protein Intensities (n={complete_data.shape[0]})",
    fontsize=17, fontweight="bold", pad=15, loc="left"
)

# Custom legend: combine color and marker
from matplotlib.lines import Line2D
legend_elements = [
    Line2D(
        [0], [0],
        marker=group_marker[(cond, dura)],
        color='w',
        markerfacecolor=group_palette[(cond, dura)],
        markeredgecolor='k',
        markersize=13,
        linewidth=0,
        label=f"{cond}, {dura}hr"
    )
    for (cond, dura) in unique_groups.values
]
ax.legend(
    handles=legend_elements,
    title="Condition, Duration",
    bbox_to_anchor=(1.02, 1),
    loc="upper left",
    frameon=False,
    fontsize=11,
    title_fontsize=12,
    borderaxespad=0.5
)

sns.despine(ax=ax, left=True, bottom=True)
plt.tight_layout()
plots.finalize_plot(
    fig, show=True, save=save_to_folder, 
    filename='PCA_meanTop_Proteins',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)

Similar separation of hypoxia and normoxia samples is observed, however the separation between normoxia timepoints is still not well captured.


02.3 Mean of All Peptides per Protein

Calculate protein intensities by averaging all detected peptides per protein. This approach incorporates all available peptide information but may be influenced by outlier peptides.

02.3.1 Calculating Protein Intensities from All Peptides

peptide_df = pd.read_feather(f"{input_path}imputed_data.feather").reset_index()
# Ensure the first two columns are 'Protein' and 'unique_id', rest are samples
sample_cols = peptide_df.columns[2:]

# Average all peptides per protein
protein_all = peptide_df.groupby('Protein')[sample_cols].mean()

print(f'Protein-level intensities (All peptides mean) shape: {protein_all.shape}')

print(f"\n--- 🔄 Median Centering Normalization ---")
protein_all = normalize.by_median_centering(
    df=protein_all,
    rescale_to_original_magnitude=True,
    condition_map=None,
    verbose=verbose
)

cv_data = pd.DataFrame(index = protein_all.index)
# Calculate the coefficient of variation per biological sample
for k, v in condition_to_samples.items():
    # if not all(item in protein_all.columns for item in v):
    #     continue
    cv_data[k] = utils.cv_numpy(
        protein_all[v].values,
        axis=1,
        ignore_nan=True,
        format="percent"
    )

plot_data = utils.create_cv_group_plot_data(
    cv_data=cv_data,
    cv_group_palettes=cv_group_palettes
).dropna(subset=['CV'])
plot_data['Method'] = 'mean(all)'
cv_datas.append(plot_data)


protein_all.head()
Protein-level intensities (All peptides mean) shape: (7161, 14)

--- 🔄 Median Centering Normalization ---

Applying Median Centering...
  - Applying global median centering...
    - Rescaling by a global factor of: 16384.00
  - Data after median centering has shape: (7161, 14)
  - Missing values (Before/After): (0/0)
Colname 1%_48_1 1%_48_2 1%_48_3 1%_48_4 1%_72_2 1%_72_3 1%_72_4 C_48_1 C_48_2 C_48_3 C_48_4 C_72_1 C_72_2 C_72_3
Protein
A0A024RBG1 27636.4812 30005.0847 27084.4570 30501.0198 26878.6847 27282.2816 24778.7294 31250.4761 28260.9575 27105.6051 31660.7097 28399.4406 36102.3521 33361.4951
A0A0B4J2D5 57134.1133 58310.0884 55598.9816 62191.6960 64318.6873 65554.3341 61014.3493 60071.2849 46073.6487 49443.3042 60765.3547 59501.3649 55451.3679 51310.4708
A0A1B0GTU1 9527.9206 8306.3244 12492.2302 10603.8432 6954.6379 7748.1887 5544.2338 12078.9927 10559.6725 9870.6067 10691.6913 10105.9986 13391.0161 13435.8479
A0A1W2PQL4 87048.5451 101193.6058 63425.9557 93900.9660 111078.8289 116289.6754 102781.4867 93745.7319 106650.8837 107614.7891 104448.3260 90009.2451 104011.9006 113290.7150
A0A2R8Y4L2 1178694.3393 1016651.1508 1031318.9612 955571.0935 1131586.4682 1103637.8918 934521.9371 1030494.1624 1001852.8436 1087337.1202 997756.8049 1181303.7617 1012222.6601 982521.7678

02.3.2 PCA Visualization

Perform PCA on the cleaned protein intensities to visualize sample clustering and separation between hypoxia and normoxia conditions.

# PCA
scaler = "zscore"
complete_data = protein_all.fillna(0)
scaled_data = utils.scale_the_data(
    complete_data,
    method=scaler,
    axis=1,
    is_log=False
)

# Run PCA
pca = PCA()
pca_res = pca.fit_transform(scaled_data.T)
pca_exp_var = pca.explained_variance_ratio_

# Set Variables
C1, C2 = 0, 1
hue_group = "Condition"
style_group = "Duration"
comp_locs = [C1, C2]
comp_names = [f'PC{c+1}' for c in comp_locs]
xlabel = f'{comp_names[0]} ({pca_exp_var[comp_locs[0]]:.2%} of variance)'
ylabel = f'{comp_names[1]} ({pca_exp_var[comp_locs[1]]:.2%} of variance)'

# Build the PCA Plot Data
plot_data = pd.DataFrame(pca_res[:, comp_locs], columns=comp_names)
plot_data['UniqueNames'] = complete_data.columns
plot_data['Condition'] = plot_data['UniqueNames'].map(samples_to_condition)
plot_data['Duration'] = plot_data['UniqueNames'].str.split('_').str[1]

# Find Centroids for each group (Condition and Duration)
centroid_data = plot_data.groupby([hue_group, style_group])[comp_names].mean().reset_index()

# Prepare color/marker mapping for unique (hue, style) combinations
unique_groups = centroid_data[[hue_group, style_group]].drop_duplicates()
group_palette = {}
group_marker = {}
markers = ["o", "s", "D", "^", "v", "P", "X", "*"]
for idx, (cond, dura) in enumerate(unique_groups.values):
    color = condition_colors.get(cond, "#888888")
    marker = markers[idx % len(markers)]
    group_palette[(cond, dura)] = color
    group_marker[(cond, dura)] = marker

fig, ax = plt.subplots(figsize=(9, 7))
fig.patch.set_facecolor('white')

# Plot technical replicates
for (cond, dura), group_df in plot_data.groupby([hue_group, style_group]):
    color = group_palette.get((cond, dura), "#888888")
    marker = group_marker.get((cond, dura), "o")
    ax.scatter(
        group_df[comp_names[0]], group_df[comp_names[1]],
        label=f"{cond}, {dura}hr",
        color=color, marker=marker, s=80, edgecolor='k', alpha=0.85, linewidth=1.2, zorder=2
    )
    # Confidence ellipse
    if len(group_df) > 1:
        plots.confidence_ellipse(
            group_df[comp_names[0]], group_df[comp_names[1]], ax, n_std=2,
            edgecolor=color, facecolor=color, alpha=0.18, linewidth=1.5, zorder=1
        )

# Plot centroids
for (cond, dura), group_df in centroid_data.groupby([hue_group, style_group]):
    color = group_palette.get((cond, dura), "#888888")
    marker = group_marker.get((cond, dura), "o")
    ax.scatter(
        group_df[comp_names[0]], group_df[comp_names[1]],
        color=color, marker=marker, s=260, edgecolor='k', linewidth=2.5, alpha=1, zorder=3
    )

# Axis limits and grid
minVal = plot_data[comp_names].min().min()
maxVal = plot_data[comp_names].max().max()
limit = max(abs(minVal), abs(maxVal))
offset = 0.1 * limit
limit += offset
ax.set_xlim(-limit, limit)
ax.set_ylim(-limit, limit)
ax.grid(True, which="both", linestyle="--", linewidth=0.7, alpha=0.4, color="grey")

# 0 lines
ax.axvline(0, color='grey', linestyle='--', linewidth=1.2, alpha=0.6, zorder=0)
ax.axhline(0, color='grey', linestyle='--', linewidth=1.2, alpha=0.6, zorder=0)

# Labels and title
ax.set_xlabel(xlabel, fontsize=14, fontweight="bold")
ax.set_ylabel(ylabel, fontsize=14, fontweight="bold")
ax.set_title(
    f"PCA of Averaging All Peptides to Protein Intensities (n={complete_data.shape[0]})",
    fontsize=17, fontweight="bold", pad=15, loc="left"
)

# Custom legend: combine color and marker
from matplotlib.lines import Line2D
legend_elements = [
    Line2D(
        [0], [0],
        marker=group_marker[(cond, dura)],
        color='w',
        markerfacecolor=group_palette[(cond, dura)],
        markeredgecolor='k',
        markersize=13,
        linewidth=0,
        label=f"{cond}, {dura}hr"
    )
    for (cond, dura) in unique_groups.values
]
ax.legend(
    handles=legend_elements,
    title="Condition, Duration",
    bbox_to_anchor=(1.02, 1),
    loc="upper left",
    frameon=False,
    fontsize=11,
    title_fontsize=12,
    borderaxespad=0.5
)

sns.despine(ax=ax, left=True, bottom=True)
plt.tight_layout()
plots.finalize_plot(
    fig, show=True, save=save_to_folder, 
    filename='PCA_meanAll_Proteins',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)

Similar trends are observed as with the top-3 peptide method, with hypoxia and normoxia samples separating but timepoint differences remaining less distinct.


02.4 ProteoForge Differential Proteoforms (dPFs)

Build proteoform-level intensities using ProteoForge's differential proteoform (dPF) assignments. Each protein may have multiple dPFs representing distinct regulatory states identified through peptide correlation clustering.

Note: PTM-specific peptides (dPF=-1) are excluded as they represent post-translational modifications rather than quantifiable proteoform abundance.

02.4.1 Calculating Protein Intensities from dPFs - 48hr

Note: The Individual 48hr and 72hr only ProteoForge dPF quantifications used to better capture timepoint-specific proteoform changes. Since using the Comb version of the analysis would make the comparison unfair as it would have more data to cluster peptides together from both timepoints.

dPF_long = pd.read_feather(f"{output_path}test_data_48hr.feather")

# Remove the PTM peptides (dPF=-1) since they don't contribute to the protein quantification
dPF_long = dPF_long[dPF_long['dPF'] != -1].reset_index(drop=True)
dPF_long['Proteoform'] = dPF_long['Protein'] + "_" + dPF_long['dPF'].astype(str)
protein_dpf_48 = dPF_long.pivot_table(
    index='Proteoform',
    columns='Sample',
    values='Intensity',
    aggfunc='mean'  # In case of duplicates, take the mean
)

print(f'Protein-level intensities (Proteoform dPF) shape: {protein_dpf_48.shape}')
print(f"\n--- 🔄 Median Centering Normalization ---")
protein_dpf_48 = normalize.by_median_centering(
    df=protein_dpf_48,
    rescale_to_original_magnitude=True,
    # condition_map={k: v for k, v in condition_to_samples.items() if '48 Hr' in k}, 
    condition_map=None,
    verbose=verbose
)



cv_data = pd.DataFrame(index = protein_dpf_48.index)
# Calculate the coefficient of variation per biological sample
for k, v in condition_to_samples.items():
    if not all(item in protein_dpf_48.columns for item in v):
        continue
    cv_data[k] = utils.cv_numpy(
        protein_dpf_48[v].values,
        axis=1,
        ignore_nan=True,
        format="percent"
    )

plot_data = utils.create_cv_group_plot_data(
    cv_data=cv_data,
    cv_group_palettes=cv_group_palettes,
    id_col='Proteoform'
).dropna(subset=['CV'])
plot_data['Method'] = 'dPFs'
cv_datas.append(plot_data)
Protein-level intensities (Proteoform dPF) shape: (7685, 8)

--- 🔄 Median Centering Normalization ---

Applying Median Centering...
  - Applying global median centering...
    - Rescaling by a global factor of: 16384.00
  - Data after median centering has shape: (7685, 8)
  - Missing values (Before/After): (0/0)

02.4.2 Calculating Protein Intensities from dPFs - 72hr

Load and process 72-hour timepoint dPF data using the same approach as 48hr.

dPF_long = pd.read_feather(f"{output_path}test_data_72hr.feather")

# Remove the PTM peptides (dPF=-1) since they don't contribute to the protein quantification
dPF_long = dPF_long[dPF_long['dPF'] != -1].reset_index(drop=True)
dPF_long['Proteoform'] = dPF_long['Protein'] + "_" + dPF_long['dPF'].astype(str)
protein_dpf_72 = dPF_long.pivot_table(
    index='Proteoform',
    columns='Sample',
    values='Intensity',
    aggfunc='mean'  # In case of duplicates, take the mean
)

print(f'Protein-level intensities (Proteoform dPF) shape: {protein_dpf_72.shape}')
print(f"\n--- 🔄 Median Centering Normalization ---")
protein_dpf_72 = normalize.by_median_centering(
    df=protein_dpf_72,
    rescale_to_original_magnitude=True,
    # condition_map={k: v for k, v in condition_to_samples.items() if '72 Hr' in k}, 
    condition_map=None,
    verbose=verbose
)

cv_data = pd.DataFrame(index = protein_dpf_72.index)
# Calculate the coefficient of variation per biological sample
for k, v in condition_to_samples.items():
    if not all(item in protein_dpf_72.columns for item in v):
        continue
    cv_data[k] = utils.cv_numpy(
        protein_dpf_72[v].values,
        axis=1,
        ignore_nan=True,
        format="percent"
    )

plot_data = utils.create_cv_group_plot_data(
    cv_data=cv_data,
    cv_group_palettes=cv_group_palettes,
    id_col='Proteoform'
).dropna(subset=['CV'])

plot_data['Method'] = 'dPFs'
cv_datas.append(plot_data)
Protein-level intensities (Proteoform dPF) shape: (8299, 6)

--- 🔄 Median Centering Normalization ---

Applying Median Centering...
  - Applying global median centering...
    - Rescaling by a global factor of: 8192.00
  - Data after median centering has shape: (8299, 6)
  - Missing values (Before/After): (0/0)

02.4.3 PCA Visualization

Perform PCA on the cleaned protein intensities to visualize sample clustering and separation between hypoxia and normoxia conditions.

dPF_Proteins = pd.concat([protein_dpf_48, protein_dpf_72], axis=1)
# PCA
scaler = "zscore"
complete_data = dPF_Proteins.fillna(0)
scaled_data = utils.scale_the_data(
    complete_data,
    method=scaler,
    axis=1,
    is_log=False
)

# Run PCA
pca = PCA()
pca_res = pca.fit_transform(scaled_data.T)
pca_exp_var = pca.explained_variance_ratio_

# Set Variables
C1, C2 = 0, 1
hue_group = "Condition"
style_group = "Duration"
comp_locs = [C1, C2]
comp_names = [f'PC{c+1}' for c in comp_locs]
xlabel = f'{comp_names[0]} ({pca_exp_var[comp_locs[0]]:.2%} of variance)'
ylabel = f'{comp_names[1]} ({pca_exp_var[comp_locs[1]]:.2%} of variance)'

# Build the PCA Plot Data
plot_data = pd.DataFrame(pca_res[:, comp_locs], columns=comp_names)
plot_data['UniqueNames'] = complete_data.columns
plot_data['Condition'] = plot_data['UniqueNames'].map(samples_to_condition)
plot_data['Duration'] = plot_data['UniqueNames'].str.split('_').str[1]

# Find Centroids for each group (Condition and Duration)
centroid_data = plot_data.groupby([hue_group, style_group])[comp_names].mean().reset_index()

# Prepare color/marker mapping for unique (hue, style) combinations
unique_groups = centroid_data[[hue_group, style_group]].drop_duplicates()
group_palette = {}
group_marker = {}
markers = ["o", "s", "D", "^", "v", "P", "X", "*"]
for idx, (cond, dura) in enumerate(unique_groups.values):
    color = condition_colors.get(cond, "#888888")
    marker = markers[idx % len(markers)]
    group_palette[(cond, dura)] = color
    group_marker[(cond, dura)] = marker

fig, ax = plt.subplots(figsize=(9, 7))
fig.patch.set_facecolor('white')

# Plot technical replicates
for (cond, dura), group_df in plot_data.groupby([hue_group, style_group]):
    color = group_palette.get((cond, dura), "#888888")
    marker = group_marker.get((cond, dura), "o")
    ax.scatter(
        group_df[comp_names[0]], group_df[comp_names[1]],
        label=f"{cond}, {dura}hr",
        color=color, marker=marker, s=80, edgecolor='k', alpha=0.85, linewidth=1.2, zorder=2
    )
    # Confidence ellipse
    if len(group_df) > 1:
        plots.confidence_ellipse(
            group_df[comp_names[0]], group_df[comp_names[1]], ax, n_std=2.5,
            edgecolor=color, facecolor=color, alpha=0.18, linewidth=1.5, zorder=1
        )

# Plot centroids
for (cond, dura), group_df in centroid_data.groupby([hue_group, style_group]):
    color = group_palette.get((cond, dura), "#888888")
    marker = group_marker.get((cond, dura), "o")
    ax.scatter(
        group_df[comp_names[0]], group_df[comp_names[1]],
        color=color, marker=marker, s=260, edgecolor='k', linewidth=2.5, alpha=1, zorder=3
    )

# Axis limits and grid
minVal = plot_data[comp_names].min().min()
maxVal = plot_data[comp_names].max().max()
limit = max(abs(minVal), abs(maxVal))
# offset = 0.10 * limit
# limit += offset
# ax.set_xlim(-limit, limit)
# ax.set_ylim(-limit, limit)
ax.grid(True, which="both", linestyle="--", linewidth=0.7, alpha=0.4, color="grey")

# 0 lines
ax.axvline(0, color='grey', linestyle='--', linewidth=1.2, alpha=0.6, zorder=0)
ax.axhline(0, color='grey', linestyle='--', linewidth=1.2, alpha=0.6, zorder=0)

# Labels and title
ax.set_xlabel(xlabel, fontsize=14, fontweight="bold")
ax.set_ylabel(ylabel, fontsize=14, fontweight="bold")
ax.set_title(
    f"PCA of Averaging dPF Peptides (n={complete_data.shape[0]})",
    fontsize=17, fontweight="bold", pad=15, loc="left"
)

# Custom legend: combine color and marker
from matplotlib.lines import Line2D
legend_elements = [
    Line2D(
        [0], [0],
        marker=group_marker[(cond, dura)],
        color='w',
        markerfacecolor=group_palette[(cond, dura)],
        markeredgecolor='k',
        markersize=13,
        linewidth=0,
        label=f"{cond}, {dura}hr"
    )
    for (cond, dura) in unique_groups.values
]
ax.legend(
    handles=legend_elements,
    title="Condition, Duration",
    bbox_to_anchor=(1.02, 1),
    loc="upper left",
    frameon=False,
    fontsize=11,
    title_fontsize=12,
    borderaxespad=0.5
)

sns.despine(ax=ax, left=True, bottom=True)
plt.tight_layout()
plots.finalize_plot(
    fig, show=True, save=save_to_folder, 
    filename='PCA_dPF_Proteins',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)

Due to the proteoform-level resolution and assignemtns of the peptides to them for calculation the separation and structure is now super granular and distinct. Both hypoxia and normoxia samples separate well, with clearer distinctions between timepoints compared to previous methods.


02.5 Cumulative PCA Variance Across Methods

Here are the computed cumulative PCA variance curves for each protein quantification method and compared the number of PCs required to reach 80% and 90% variance while checking PC1/PC2 separations and loadings. This analysis demonstrates whether dPFs concentrate biologically relevant variance into fewer components or increase dimensionality by revealing proteoform-specific variability, guiding method selection.

variance_data = []
proteinDatasets = {
    "Original": original_protein,
    "mean(top)": protein_top3,
    "mean(all)": protein_all,
    "dPFs": dPF_Proteins
}
for k, v in proteinDatasets.items():
    # Perform PCA on the centered data
    
    # Drop any missing values
    complete_data = v
    # Scale the data
    scaled_data = utils.scale_the_data(
        complete_data,
        method='zscore',
        axis=1,
        is_log=False
    ).fillna(0)
    # Run the PCA
    pca = PCA()
    pca_res = pca.fit_transform(scaled_data.T)
    pca_exp_var = pca.explained_variance_ratio_
    pca_exp_var_cum = np.cumsum(pca_exp_var)
    for i, cum_var in enumerate(pca_exp_var_cum):
        variance_data.append({
            "Data": k,
            "PC Component": i + 1,
            "Cumulative Variance": cum_var
        })

plot_data = pd.DataFrame(variance_data)

# Improved cumulative explained variance plot for PCA

import matplotlib.ticker as mtick

fig, ax = plt.subplots(figsize=(7, 6))

# Robust lineplot with markers and edgecolor for visibility
sns.lineplot(
    data=plot_data,
    x="PC Component",
    y="Cumulative Variance",
    hue="Data",
    style="Data",
    palette=proteinMethods,
    dashes=False,
    linewidth=2.5,
    markers=True,
    markersize=9,
    alpha=0.85,
    ax=ax,
    legend="full",
    markeredgecolor="grey"  # Add edgecolor for marker visibility
)

# Add horizontal reference lines for key variance thresholds
for thresh, color in zip([0.8, 0.9, 0.95], ['#e9c46a', '#f4a261', '#0077b6']):
    ax.axhline(thresh, color=color, linestyle='--', linewidth=1.2, alpha=0.5, zorder=0)
    ax.text(
        plot_data["PC Component"].max() + 0.2, thresh, f"{int(thresh*100)}%",
        va='center', ha='left', color=color, fontsize=11, fontweight='bold', alpha=0.7
    )

# Annotate the number of PCs needed to reach 80% and 90% variance for each method
for method in plot_data['Data'].unique():
    method_data = plot_data[plot_data['Data'] == method]
    for thresh in [0.8, 0.9]:
        pcs = method_data[method_data['Cumulative Variance'] >= thresh]['PC Component']
        if not pcs.empty:
            pc_num = pcs.iloc[0]
            var_val = method_data[method_data['PC Component'] == pc_num]['Cumulative Variance'].values[0]
            ax.scatter(pc_num, var_val, color=proteinMethods[method], s=80, edgecolor='k', zorder=5)
            ax.text(
                pc_num, var_val + 0.03, f"{pc_num} PCs", 
                ha='center', va='bottom', fontsize=10, fontweight='semibold', color=proteinMethods[method]
            )

# Set axis limits and ticks
ax.set_ylim(-0.05, 1.05)
ax.set_xlim(1, plot_data["PC Component"].max())
ax.set_xticks(range(1, plot_data["PC Component"].max() + 1))
ax.set_xticklabels(
    [str(i) for i in range(1, plot_data["PC Component"].max() + 1)],
    rotation=0, ha="center", fontsize=11, fontweight="bold"
)

# Format y-axis as percent
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0, decimals=0))

# Axis labels and title
ax.set_xlabel("Principal Component Number", fontsize=14, fontweight="bold", labelpad=10)
ax.set_ylabel("Cumulative Explained Variance (%)", fontsize=14, fontweight="bold", labelpad=10)
ax.set_title("Cumulative Explained Variance by PCA\nfor Different Protein Quantification Methods", fontsize=16, fontweight="bold", pad=15)

# Grid and background
ax.grid(True, which="both", linestyle="--", linewidth=0.7, alpha=0.5, color="lightgrey")
ax.set_axisbelow(True)
ax.set_facecolor("#fafafa")

# Legend styling
handles, labels = ax.get_legend_handles_labels()
ax.legend(
    handles,
    labels,
    title="Protein Quant.",
    fontsize=12,
    title_fontsize=13,
    loc='center left',
    bbox_to_anchor=(1.01, 0.5),
    frameon=False
)

# Remove top/right spines for a cleaner look
sns.despine(ax=ax, top=True, right=True, left=False, bottom=False)

plt.tight_layout()
plots.finalize_plot(
    fig, show=True, save=save_to_folder, 
    filename='PCA_CumulativeVariance_Proteins',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)

While the Original, MeanTop3, and MeanAll methods show similar variance accumulation patterns, the dPFs method captures variance more efficiently, reaching 80%-90% explained variance with fewer components (2-3). This suggests that proteoform-aware quantification enhances the biological signal in the data. It is simpler to interpret the underlying structure which leads to better separations in PCA plots as well.


03. QuEStVar Statistical Analysis

QuEStVar (Quantitative Equivalence and Statistical Variance) testing provides a framework for classifying proteins into five categories based on the combined equivalence and differential expression analysis:

  • Upregulated: Significantly higher in hypoxia vs normoxia
  • Downregulated: Significantly lower in hypoxia vs normoxia
  • Equivalent: Statistically equivalent between conditions (within equivalence boundary)
  • Unexplained: Insufficient evidence for classification
  • Excluded: Not tested, due to failed quality filters (high CV, low quantification)

This analysis compares how different protein quantification methods affect statistical outcomes.

The QuEStVar is a method that includes equivalence testing alongside traditional differential expression analysis to provide a more nuanced classification of protein regulation between conditions. Source: GitHub, Manuscript

03.1 Coefficient of Variation Assessment

First wanted to calculate CV across biological replicates for each protein quantification method to assess measurement reproducibility prior to statistical testing.

plot_data = pd.concat(cv_datas, ignore_index=True)

# Create the plot figure and axes
fig, ax = plt.subplots(figsize=(10, 7))

# Draw the bar plot with enhanced styling
bar = sns.barplot(
    data=plot_data,
    x="Condition",
    y="CV",
    hue="Method",
    ax=ax,
    estimator=np.nanmedian,
    errorbar=('ci', 95),
    palette=proteinMethods,
    alpha=0.9,
    edgecolor="black",
    linewidth=1.5,
    width=0.8
)

# --- Annotation and Styling ---

# Add median value labels inside each bar, towards the top
for container in ax.containers:
    labels = [f"{v.get_height():.1f}" if v.get_height() > 0 else '' for v in container]
    ax.bar_label(
        container,
        labels=labels,
        padding=-25, # Negative padding moves label inside the bar. Adjust as needed.
        fontsize=15,
        color='black',
        fontweight='semibold',
        bbox={"boxstyle": "round,pad=0.3", "facecolor": "white", "edgecolor": "none", "alpha": 0.5}
    )



# --- Grid and Axis Styling ---

# Add subtle horizontal gridlines for better readability
ax.yaxis.grid(True, linestyle='--', linewidth=0.6, color='gray', alpha=0.4)
ax.set_axisbelow(True)

# Remove top and right spines for a cleaner look
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_linewidth(1.2)
ax.spines['bottom'].set_linewidth(1.2)


# --- Labels, Title, and Ticks ---

# Set informative labels and title with adjusted font properties
ax.set_xlabel("Condition", fontsize=14, fontweight='bold', labelpad=15)
ax.set_ylabel("Median Coefficient of Variation (CV %)", fontsize=14, fontweight='bold', labelpad=15)
ax.set_title("Comparison of Protein-Level CVs by Method and Condition", fontsize=18, fontweight='bold', pad=20)

# Customize tick parameters for a polished look
ax.tick_params(axis='x', labelsize=12)
ax.tick_params(axis='y', labelsize=12)
ax.set_ylim(0, ax.get_ylim()[1] * 1.15) # Add some padding to the top


# --- Legend ---

# Place the legend outside the plot for better clarity
handles, labels = ax.get_legend_handles_labels()
ax.legend(
    handles,
    labels,
    title="Method",
    fontsize=12,
    title_fontsize=13,
    loc='upper left',
    bbox_to_anchor=(1, 1), # Position legend outside
    frameon=False # Remove legend frame
)


# --- Final Touches ---

# Ensure everything fits nicely within the figure
plt.tight_layout()
plots.finalize_plot(
    fig, show=True, save=save_to_folder, 
    filename='Protein_CV_Barplot',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)

# --- Overall Median Calculation (Unchanged) ---
median_cvs = plot_data.groupby('Method')['CV'].median().sort_values()
print("Overall Median CV per Method:")
for method, median_cv in median_cvs.items():
    print(f" - {method}: {median_cv:.2f}%")
Overall Median CV per Method:
 - mean(all): 6.89%
 - dPFs: 7.27%
 - mean(top): 8.55%
 - Original: 8.79%

The ProteoForge doesn't implicitely address the reproducibility of protein quantifications across replicates. Weirdly enough, the best median CVs are observed when averaging all peptides, followed by dPFs, then top3 peptides, and finally the original DIANN protein quantifications showing the highest CVs.

This suggests that proteoform-aware quantification improves reproducibility compared to the original method, but averaging all peptides still yields the lowest variability across replicates.

One would expect the original method to have the best reproducibility since there are alot of advanced normalization and interference correction steps applied by DIA-NN at the protein level. While I don't have a clear explanation for this as the original methods paper doesn't mention if they applied any high-level protein quant normalization or interference correction steps. This result suggests that further investigation into the normalization and quantification strategies at the protein level may be warranted to understand these differences.


03.2 Power Analysis for Equivalence Boundary

Determine the optimal equivalence boundary (symmetrical log2FC threshold) for equivalence testing based on observed CVs and desired statistical power.

This is an important step to ensure the equivalence testing is appropriately powered given the variability in the data. Additionally determining the minimum meaningful equivalence boundary helps determine the log2FC for differences that can be reliably detected given the data quality.

# Search for the optimal equivalence boundary
## Parameter to check
eqBoundary = None
eq_boundaries = np.linspace(0.1, 1.0, 10)

## Essential parameters
power = 0.8
nRep = 5
cvMean = 8.0 # Mean from all..
## Secondary parameters
nPrts = 1000
pThr = 0.05
corr = 'fdr'
dfThr = 1.0
cvThr = 10**5
nRepeats = 1000
nCores = None       # Use all available cores
# Simulated data parameters
int_mu = 18
int_sd = 1
int_log2 = True
cv_k = 2
cv_theta = 0.5

# Establish the Adjusted SEI
true_SEI = 1
adjs_SEI = 1 - cvMean/100

# Generate the CV distribution from the mean
cv_dist = tests.skewed_distribution(
    mu = cvMean,
    k = 2,
    theta = 0.5,
    size = nPrts
) / 100

cv_data = pd.DataFrame({
    "cvMean": cvMean,
    "cvDist": cv_dist
})

iterations = []
for eqThr in eq_boundaries:
    for i in range(nRepeats):
        iterations.append((
            cvMean, eqThr, nPrts, nRep, pThr, dfThr, cvThr, corr, i, int_mu, int_sd, int_log2, cv_k, cv_theta
        ))
print(f"Total iterations: {len(iterations)}")

# Run the simulation
results_df = tests.multiprocess_simulation(
    iterations = iterations,
    nCores = 28
)
# Calculate the power from SEI
results_df["calc_SEI"] = results_df["calc_SEI"].astype(float)
results_df["calc_power"] = results_df["calc_SEI"].apply(
    lambda x: tests.calculate_power(
        simulated_sei=x,
        target_sei=adjs_SEI
    )
)

# Print the results
tests.print_power_analysis_results(
    results_df = results_df,
    power = power,
    eqBoundaries = eq_boundaries,
    nRep = nRep,
    cvMean = cvMean,
    nPrts = nPrts,
    pThr = pThr,
    corr = corr,
    # dfThr = dfThr,
    # cvThr = cvThr,
    nRepeat = nRepeats
)
# Visualize the power profile
plots.single_variable_power_profile(
    results_df,
    x_axis_variable="eqThr",
    y_axis_variable = "calc_power",
    target_power = power,
    figtitle="Power profiling for equivalence boundaries",
    xlabel="Symmetrical Equivalence Boundary",
    save=save_to_folder,
    show=True,
    filepath=figure_path,
    filename="powerProfile_eqThr_linePlot",
    fileformats=figure_formats,
)
Total iterations: 10000

Power Analysis Results:
-----------------------

Input Parameters:
  - Target power: 0.80
  - Sample size (rep) per Group: 5
  - Mean Intra-sample CV: 8.00%
  - Number of Proteins: 1000
  - Significance threshold (p-value): 0.05
  - P-value correction method: fdr
  - Simulation Repeats: 1000

Simulation Results:
Symmetrical equivalence boundaries tested = [0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]
  - The minimum symmetrical equivalence boundary for target: 0.30

From the power analysis, the equivalence boundary at 0.25 log2FC absolute (-0.25, 0.25) provides sufficient power (80%) to detect equivalence for the majority of proteins across all quantification methods. This threshold balances sensitivity and specificity in classifying proteins as equivalent versus differentially expressed.

Note: The power analysis helps determine the smallest effect size (equivalence boundary) that can be reliably detected given the data variability and sample size. But it is recommended to do this step at the design stage prior to any statistical testing to ensure appropriate interpretation of equivalence results. Due to using existing data here this is more of a post-hoc assessment.

03.3 Number of Proteins at Each Method

signf_protein_dict = {
    "48hr":{},
    "72hr":{}
}
# Build protein count data for all methods and durations (48hr and 72hr)
protein_count_data = pd.DataFrame({
    'Method': (
        ['Original', 'Original', 'mean(top)', 'mean(top)', 'mean(all)', 'mean(all)', 'dPFs', 'dPFs']
    ),
    'Duration': (
        ['48hr', '72hr', '48hr', '72hr',  '48hr', '72hr', '48hr', '72hr']
    ),
    'Count Type': ['All Proteins'] * 8,
    'Count': [
        original_protein.shape[0],  # Original 48hr
        original_protein.shape[0],  # Original 72hr (same shape, as index is all samples)
        protein_top3.shape[0],      # mean(top) 48hr
        protein_top3.shape[0],      # mean(top) 72hr
        protein_all.shape[0],       # mean(all) 48hr
        protein_all.shape[0],       # mean(all) 72hr
        protein_dpf_48.shape[0],    # dPFs 48hr
        protein_dpf_72.shape[0],    # dPFs 72hr
    ]
})
protein_count_data
Method Duration Count Type Count
0 Original 48hr All Proteins 9029
1 Original 72hr All Proteins 9029
2 mean(top) 48hr All Proteins 7161
3 mean(top) 72hr All Proteins 7161
4 mean(all) 48hr All Proteins 7161
5 mean(all) 72hr All Proteins 7161
6 dPFs 48hr All Proteins 7685
7 dPFs 72hr All Proteins 8299

This is a simple look at the number of proteins that were available, The mean(top) and mean(all) methods have same number of proteins 7161 due to this was the number of proteins used for the calculation from peptide-level data. The Original method has a lot more 9029 due to a lot more proteins being saved through cleaning steps. The dPFs method has different number of proteins for each timepoint due to the clustering being done separately for each timepoint. The 48hr has 7685 proteins while the 72hr has 8299 proteins, showing the based on the available number of dPFs the number of proteins can vary.


03.3 QuEStVar Analysis - 48hr Timepoint

Apply QuEStVar testing to compare True Hypoxia (1% O₂) vs Normoxia (21% O₂) at the 48-hour timepoint using all four protein quantification methods.

Note: Same statistical parameters used for all methods and timepoints to ensure comparability. [eq_thr=0.25, df_thr=0.25, p_thr=0.05, corr='fdr_bh', equal_var=False, cv_thr=1 (100% CV cutoff)]

03.3.1 Original Protein Intensities

p_thr = 0.05
df_thr = 0.5
eq_thr = 0.25  # From the power analysis above
cv_thr = 1  # > 100% CV will be excluded
equal_var = False 
correction = 'fdr'  
cond_1 = "True Hypoxia (1% O2) - 48 Hr"
cond_2 = "Normoxia (21% O2) - 48 Hr"

print(" RUNNING QUESTVAR STATISTICAL ANALYSIS")
print("=" * 80)
print(f"Parameters: p_thr={p_thr}, cv_thr={cv_thr}, df_thr={df_thr}, eq_thr={eq_thr}")
print(f"Correction: {correction.upper()}, Conditions: {cond_1} vs {cond_2}, Equal Variance: {equal_var}")

res, info = tests.run_questvar(
    S1_arr=original_protein[condition_to_samples[cond_1]].values,
    S2_arr=original_protein[condition_to_samples[cond_2]].values,
    is_log2=False,
    cv_thr=cv_thr,
    p_thr=p_thr,
    df_thr=df_thr,
    eq_thr=eq_thr,
    var_equal=equal_var,
    is_paired=False,
    correction=correction,
    allow_missing=True,
)

info = info.set_index( original_protein.index ).reset_index().set_index('Entry')

info['Status'] = info['Status'].replace({
    0.0: 'Unexplained',
    1.0: 'Equivalent',
    -1.0: 'Different',
}).fillna('Excluded')

quest = info.merge(
    res.iloc[:, :-1], 
    left_index=True, 
    right_index=True, 
    how='left'
)
quest['-log10(Fdr)'] = -np.log10(
    quest['df_adjp']
).replace([np.inf, -np.inf], np.nan)

quest['Status'] = np.where(
    quest['Status'] == 'Different',
    np.where(quest['log2FC'] > 0, 'Upregulated', 'Downregulated'),
    quest['Status']
)

id_col='Protein'
# Find the fasta info and map it, ensuring only one id_col is kept
cols = ['Gene', 'Description', 'Length', 'Weight(kDa)']#, 'Category', 'SPC']
qcols = [col for col in quest.columns if col != id_col]
quest = quest.merge(
    fasta_data[[id_col] + cols],
    left_on=id_col,
    right_on=id_col,
    how='left'
)
# Order the columns, keeping only one id_col
quest = quest[[id_col] + cols + qcols]

print()
print("📊 QUESTVAR ANALYSIS RESULTS")
print("-" * 60)

status_counts = quest['Status'].value_counts()
total_proteins = len(quest)

print(f"Total proteins analyzed: {total_proteins:,}")
print(f"Results breakdown:")
for status, count in status_counts.items():
    percentage = (count/total_proteins)*100
    print(f"   • {status}: {count:,} ({percentage:.1f}%)")
print()

# Add "Equivalent", "Upregulated", "Downregulated" to the dictionary
signf_protein_dict['48hr']["Original"] = {
    'Upregulated': quest.loc[quest['Status'] == 'Upregulated', 'Protein'].unique(),
    'Downregulated': quest.loc[quest['Status'] == 'Downregulated', 'Protein'].unique(),
    'Equivalent': quest.loc[quest['Status'] == 'Equivalent', 'Protein'].unique()
}
# Add thesattus counts to the protein_count_data DataFrame
protein_count_data = pd.concat(
    [
        protein_count_data,
        pd.DataFrame({
            'Method': ['Original'] * 5,
            'Duration': ['48hr'] * 5,
            'Count Type': status_counts.index.tolist(),
            'Count': status_counts.values.tolist()
        })
    ],
    axis=0,
    ignore_index=True
)

# Save results
output_data = f"{output_path}OriginalProtein_48hr_QuEStVar.feather"
print(f"💾 SAVING RESULTS to {output_data}")
print("-" * 80)
quest.to_feather(f"{output_data}")
print()

# Generate main summary plot
print("📈 GENERATING QUESTVAR SUMMARY PLOT")
print("-" * 50)

plots.questvar_test_summary(
    quest_data=quest,
    p_thr=p_thr,
    eq_thr=eq_thr,
    df_thr=df_thr,
    cv_thr=cv_thr,
    correction=correction,
    cond_1=cond_1,
    cond_2=cond_2,
    figsize=(20, 15),
    title_add=f"Original Protein Data: {cond_1} vs {cond_2}",
    show=True,
    save=save_to_folder,
    filename=f"QuEStVar_Summary_OriginalProtein_48hr_Hypoxia_vs_Normoxia",
    fileformats=figure_formats,
    filepath=figure_path,
    transparent=transparent_bg,
    dpi=figure_dpi
)
 RUNNING QUESTVAR STATISTICAL ANALYSIS
================================================================================
Parameters: p_thr=0.05, cv_thr=1, df_thr=0.5, eq_thr=0.25
Correction: FDR, Conditions: True Hypoxia (1% O2) - 48 Hr vs Normoxia (21% O2) - 48 Hr, Equal Variance: False

📊 QUESTVAR ANALYSIS RESULTS
------------------------------------------------------------
Total proteins analyzed: 9,029
Results breakdown:
   • Unexplained: 6,889 (76.3%)
   • Equivalent: 1,892 (21.0%)
   • Excluded: 164 (1.8%)
   • Upregulated: 65 (0.7%)
   • Downregulated: 19 (0.2%)

💾 SAVING RESULTS to ./data/results/hypoxia/OriginalProtein_48hr_QuEStVar.feather
--------------------------------------------------------------------------------

📈 GENERATING QUESTVAR SUMMARY PLOT
--------------------------------------------------

These figures are my summany plots for QuEStVar they represent an overview of the test results, the panel descriptions and the method details are in the figure itself. But couple of things to point out. The Original 48hr comparison of hypoxia vs normoxia shows majority of statistically proteins to be statistically equivalent with some upregulated and downregulated proteins as well.


03.3.2 Top-3 Averaged Protein Intensities

cond_1 = "True Hypoxia (1% O2) - 48 Hr"
cond_2 = "Normoxia (21% O2) - 48 Hr"

print(" RUNNING QUESTVAR STATISTICAL ANALYSIS")
print("=" * 80)
print(f"Parameters: p_thr={p_thr}, cv_thr={cv_thr}, df_thr={df_thr}, eq_thr={eq_thr}")
print(f"Correction: {correction.upper()}, Conditions: {cond_1} vs {cond_2}, Equal Variance: {equal_var}")

res, info = tests.run_questvar(
    S1_arr=protein_top3[condition_to_samples[cond_1]].values,
    S2_arr=protein_top3[condition_to_samples[cond_2]].values,
    is_log2=False,
    cv_thr=cv_thr,
    p_thr=p_thr,
    df_thr=df_thr,
    eq_thr=eq_thr,
    var_equal=equal_var,
    is_paired=False,
    correction=correction,
    allow_missing=True,
)

info = info.set_index( protein_top3.index ).reset_index().set_index('Entry')

info['Status'] = info['Status'].replace({
    0.0: 'Unexplained',
    1.0: 'Equivalent',
    -1.0: 'Different',
}).fillna('Excluded')

quest = info.merge(
    res.iloc[:, :-1], 
    left_index=True, 
    right_index=True, 
    how='left'
)
quest['-log10(Fdr)'] = -np.log10(
    quest['df_adjp']
).replace([np.inf, -np.inf], np.nan)

quest['Status'] = np.where(
    quest['Status'] == 'Different',
    np.where(quest['log2FC'] > 0, 'Upregulated', 'Downregulated'),
    quest['Status']
)

id_col='Protein'
# Find the fasta info and map it, ensuring only one id_col is kept
cols = ['Gene', 'Description', 'Length', 'Weight(kDa)']#, 'Category', 'SPC']
qcols = [col for col in quest.columns if col != id_col]
quest = quest.merge(
    fasta_data[[id_col] + cols],
    left_on=id_col,
    right_on=id_col,
    how='left'
)
# Order the columns, keeping only one id_col
quest = quest[[id_col] + cols + qcols]

print()
print("📊 QUESTVAR ANALYSIS RESULTS")
print("-" * 60)

status_counts = quest['Status'].value_counts()
total_proteins = len(quest)

print(f"Total proteins analyzed: {total_proteins:,}")
print(f"Results breakdown:")
for status, count in status_counts.items():
    percentage = (count/total_proteins)*100
    print(f"   • {status}: {count:,} ({percentage:.1f}%)")
print()

# Add "Equivalent", "Upregulated", "Downregulated" to the dictionary
signf_protein_dict['48hr']["mean(top)"] = {
    'Upregulated': quest.loc[quest['Status'] == 'Upregulated', 'Protein'].unique(),
    'Downregulated': quest.loc[quest['Status'] == 'Downregulated', 'Protein'].unique(),
    'Equivalent': quest.loc[quest['Status'] == 'Equivalent', 'Protein'].unique()
}
# Add thesattus counts to the protein_count_data DataFrame
protein_count_data = pd.concat(
    [
        protein_count_data,
        pd.DataFrame({
            'Method': ['mean(top)'] * 5,
            'Duration': ['48hr'] * 5,
            'Count Type': status_counts.index.tolist(),
            'Count': status_counts.values.tolist()
        })
    ],
    axis=0,
    ignore_index=True
)


# Save results
output_data = f"{output_path}Top3Protein_48hr_QuEStVar.feather"
print(f"💾 SAVING RESULTS to {output_data}")
print("-" * 80)
quest.to_feather(f"{output_data}")
print()

# Generate main summary plot
print("📈 GENERATING QUESTVAR SUMMARY PLOT")
print("-" * 50)

plots.questvar_test_summary(
    quest_data=quest,
    p_thr=p_thr,
    eq_thr=eq_thr,
    df_thr=df_thr,
    cv_thr=cv_thr,
    correction=correction,
    cond_1=cond_1,
    cond_2=cond_2,
    figsize=(20, 15),
    title_add=f"Mean of Top3 Protein Data: {cond_1} vs {cond_2}",
    show=True,
    save=save_to_folder,
    filename=f"QuEStVar_Summary_Top3Protein_48hr_Hypoxia_vs_Normoxia",
    fileformats=figure_formats,
    filepath=figure_path,
    transparent=transparent_bg,
    dpi=figure_dpi
)
 RUNNING QUESTVAR STATISTICAL ANALYSIS
================================================================================
Parameters: p_thr=0.05, cv_thr=1, df_thr=0.5, eq_thr=0.25
Correction: FDR, Conditions: True Hypoxia (1% O2) - 48 Hr vs Normoxia (21% O2) - 48 Hr, Equal Variance: False

📊 QUESTVAR ANALYSIS RESULTS
------------------------------------------------------------
Total proteins analyzed: 7,161
Results breakdown:
   • Unexplained: 5,498 (76.8%)
   • Equivalent: 1,493 (20.8%)
   • Upregulated: 106 (1.5%)
   • Downregulated: 58 (0.8%)
   • Excluded: 6 (0.1%)

💾 SAVING RESULTS to ./data/results/hypoxia/Top3Protein_48hr_QuEStVar.feather
--------------------------------------------------------------------------------

📈 GENERATING QUESTVAR SUMMARY PLOT
--------------------------------------------------

The mean(top) method shows a similar pattern to the Original method, with a large proportion of proteins classified as equivalent. However, there is a slight increase in the number of upregulated and downregulated proteins compared to the original method, suggesting that averaging the top 3 peptides may enhance sensitivity to detect differential expression.


03.3.3 All Peptides Averaged Protein Intensities

cond_1 = "True Hypoxia (1% O2) - 48 Hr"
cond_2 = "Normoxia (21% O2) - 48 Hr"

print(" RUNNING QUESTVAR STATISTICAL ANALYSIS")
print("=" * 80)
print(f"Parameters: p_thr={p_thr}, cv_thr={cv_thr}, df_thr={df_thr}, eq_thr={eq_thr}")
print(f"Correction: {correction.upper()}, Conditions: {cond_1} vs {cond_2}, Equal Variance: {equal_var}")

res, info = tests.run_questvar(
    S1_arr=protein_all[condition_to_samples[cond_1]].values,
    S2_arr=protein_all[condition_to_samples[cond_2]].values,
    is_log2=False,
    cv_thr=cv_thr,
    p_thr=p_thr,
    df_thr=df_thr,
    eq_thr=eq_thr,
    var_equal=equal_var,
    is_paired=False,
    correction=correction,
    allow_missing=True,
)

info = info.set_index( protein_all.index ).reset_index().set_index('Entry')

info['Status'] = info['Status'].replace({
    0.0: 'Unexplained',
    1.0: 'Equivalent',
    -1.0: 'Different',
}).fillna('Excluded')

quest = info.merge(
    res.iloc[:, :-1], 
    left_index=True, 
    right_index=True, 
    how='left'
)
quest['-log10(Fdr)'] = -np.log10(
    quest['df_adjp']
).replace([np.inf, -np.inf], np.nan)

quest['Status'] = np.where(
    quest['Status'] == 'Different',
    np.where(quest['log2FC'] > 0, 'Upregulated', 'Downregulated'),
    quest['Status']
)

id_col='Protein'
# Find the fasta info and map it, ensuring only one id_col is kept
cols = ['Gene', 'Description', 'Length', 'Weight(kDa)']#, 'Category', 'SPC']
qcols = [col for col in quest.columns if col != id_col]
quest = quest.merge(
    fasta_data[[id_col] + cols],
    left_on=id_col,
    right_on=id_col,
    how='left'
)
# Order the columns, keeping only one id_col
quest = quest[[id_col] + cols + qcols]

print()
print("📊 QUESTVAR ANALYSIS RESULTS")
print("-" * 60)

status_counts = quest['Status'].value_counts()
total_proteins = len(quest)

print(f"Total proteins analyzed: {total_proteins:,}")
print(f"Results breakdown:")
for status, count in status_counts.items():
    percentage = (count/total_proteins)*100
    print(f"   • {status}: {count:,} ({percentage:.1f}%)")
print()

# Add "Equivalent", "Upregulated", "Downregulated" to the dictionary
signf_protein_dict['48hr']["mean(all)"] = {
    'Upregulated': quest.loc[quest['Status'] == 'Upregulated', 'Protein'].unique(),
    'Downregulated': quest.loc[quest['Status'] == 'Downregulated', 'Protein'].unique(),
    'Equivalent': quest.loc[quest['Status'] == 'Equivalent', 'Protein'].unique()
}
# Add thesattus counts to the protein_count_data DataFrame
protein_count_data = pd.concat(
    [
        protein_count_data,
        pd.DataFrame({
            'Method': ['mean(all)'] * 5,
            'Duration': ['48hr'] * 5,
            'Count Type': status_counts.index.tolist(),
            'Count': status_counts.values.tolist()
        })
    ],
    axis=0,
    ignore_index=True
)

# Save results
output_data = f"{output_path}AllProtein_48hr_QuEStVar.feather"
print(f"💾 SAVING RESULTS to {output_data}")
print("-" * 80)
quest.to_feather(f"{output_data}")
print()

# Generate main summary plot
print("📈 GENERATING QUESTVAR SUMMARY PLOT")
print("-" * 50)

plots.questvar_test_summary(
    quest_data=quest,
    p_thr=p_thr,
    eq_thr=eq_thr,
    df_thr=df_thr,
    cv_thr=cv_thr,
    correction=correction,
    cond_1=cond_1,
    cond_2=cond_2,
    figsize=(20, 15),
    title_add=f"Mean of All Peptides - Protein Data: {cond_1} vs {cond_2}",
    show=True,
    save=save_to_folder,
    filename=f"QuEStVar_Summary_AllProtein_48hr_Hypoxia_vs_Normoxia",
    fileformats=figure_formats,
    filepath=figure_path,
    transparent=transparent_bg,
    dpi=figure_dpi
)
 RUNNING QUESTVAR STATISTICAL ANALYSIS
================================================================================
Parameters: p_thr=0.05, cv_thr=1, df_thr=0.5, eq_thr=0.25
Correction: FDR, Conditions: True Hypoxia (1% O2) - 48 Hr vs Normoxia (21% O2) - 48 Hr, Equal Variance: False

📊 QUESTVAR ANALYSIS RESULTS
------------------------------------------------------------
Total proteins analyzed: 7,161
Results breakdown:
   • Unexplained: 4,680 (65.4%)
   • Equivalent: 2,202 (30.7%)
   • Upregulated: 179 (2.5%)
   • Downregulated: 95 (1.3%)
   • Excluded: 5 (0.1%)

💾 SAVING RESULTS to ./data/results/hypoxia/AllProtein_48hr_QuEStVar.feather
--------------------------------------------------------------------------------

📈 GENERATING QUESTVAR SUMMARY PLOT
--------------------------------------------------

Similar trends from the previous two methods are observed here as well, with a large number of proteins classified as equivalent.


03.3.4 dPF-Based Proteoform Intensities

cond_1 = "True Hypoxia (1% O2) - 48 Hr"
cond_2 = "Normoxia (21% O2) - 48 Hr"

print(" RUNNING QUESTVAR STATISTICAL ANALYSIS")
print("=" * 80)
print(f"Parameters: p_thr={p_thr}, cv_thr={cv_thr}, df_thr={df_thr}, eq_thr={eq_thr}")
print(f"Correction: {correction.upper()}, Conditions: {cond_1} vs {cond_2}, Equal Variance: {equal_var}")

res, info = tests.run_questvar(
    S1_arr=protein_dpf_48[condition_to_samples[cond_1]].values,
    S2_arr=protein_dpf_48[condition_to_samples[cond_2]].values,
    is_log2=False,
    cv_thr=cv_thr,
    p_thr=p_thr,
    df_thr=df_thr,
    eq_thr=eq_thr,
    var_equal=equal_var,
    is_paired=False,
    correction=correction,
    allow_missing=True,
)

info = info.set_index( protein_dpf_48.index ).reset_index().set_index('Entry')

info['Status'] = info['Status'].replace({
    0.0: 'Unexplained',
    1.0: 'Equivalent',
    -1.0: 'Different',
}).fillna('Excluded')

quest = info.merge(
    res.iloc[:, :-1], 
    left_index=True, 
    right_index=True, 
    how='left'
)
quest['-log10(Fdr)'] = -np.log10(
    quest['df_adjp']
).replace([np.inf, -np.inf], np.nan)

quest['Status'] = np.where(
    quest['Status'] == 'Different',
    np.where(quest['log2FC'] > 0, 'Upregulated', 'Downregulated'),
    quest['Status']
)

if 'Proteoform' in quest.columns:
    quest['Protein'] = quest['Proteoform'].str.split('_').str[0]

id_col='Protein'
# Find the fasta info and map it, ensuring only one id_col is kept
cols = ['Gene', 'Description', 'Length', 'Weight(kDa)']#, 'Category', 'SPC']
qcols = [col for col in quest.columns if col != id_col]
quest = quest.merge(
    fasta_data[[id_col] + cols],
    left_on=id_col,
    right_on=id_col,
    how='left'
)

# Order the columns, keeping only one id_col
quest = quest[[id_col] + cols + qcols]

print()
print("📊 QUESTVAR ANALYSIS RESULTS")
print("-" * 60)

status_counts = quest['Status'].value_counts()
total_proteins = len(quest)

print(f"Total proteins analyzed: {total_proteins:,}")
print(f"Results breakdown:")
for status, count in status_counts.items():
    percentage = (count/total_proteins)*100
    print(f"   • {status}: {count:,} ({percentage:.1f}%)")
print()

# Add "Equivalent", "Upregulated", "Downregulated" to the dictionary
signf_protein_dict['48hr']["dPFs"] = {
    'Upregulated': quest.loc[quest['Status'] == 'Upregulated', 'Protein'].unique(),
    'Downregulated': quest.loc[quest['Status'] == 'Downregulated', 'Protein'].unique(),
    'Equivalent': quest.loc[quest['Status'] == 'Equivalent', 'Protein'].unique()
}
# Add thesattus counts to the protein_count_data DataFrame
protein_count_data = pd.concat(
    [
        protein_count_data,
        pd.DataFrame({
            'Method': ['dPFs'] * 5,
            'Duration': ['48hr'] * 5,
            'Count Type': status_counts.index.tolist(),
            'Count': status_counts.values.tolist()
        })
    ],
    axis=0,
    ignore_index=True
)

# Save results
output_data = f"{output_path}dPFProtein_48hr_QuEStVar.feather"
print(f"💾 SAVING RESULTS to {output_data}")
print("-" * 80)
quest.to_feather(f"{output_data}")
print()

# Generate main summary plot
print("📈 GENERATING QUESTVAR SUMMARY PLOT")
print("-" * 50)

plots.questvar_test_summary(
    quest_data=quest,
    p_thr=p_thr,
    eq_thr=eq_thr,
    df_thr=df_thr,
    cv_thr=cv_thr,
    correction=correction,
    cond_1=cond_1,
    cond_2=cond_2,
    figsize=(20, 15),
    title_add=f"dPFs -  Protein Data: {cond_1} vs {cond_2}",
    show=True,
    save=save_to_folder,
    filename=f"QuEStVar_Summary_dPFProtein_48hr_Hypoxia_vs_Normoxia",
    fileformats=figure_formats,
    filepath=figure_path,
    transparent=transparent_bg,
    dpi=figure_dpi
)
 RUNNING QUESTVAR STATISTICAL ANALYSIS
================================================================================
Parameters: p_thr=0.05, cv_thr=1, df_thr=0.5, eq_thr=0.25
Correction: FDR, Conditions: True Hypoxia (1% O2) - 48 Hr vs Normoxia (21% O2) - 48 Hr, Equal Variance: False

📊 QUESTVAR ANALYSIS RESULTS
------------------------------------------------------------
Total proteins analyzed: 7,685
Results breakdown:
   • Unexplained: 4,657 (60.6%)
   • Equivalent: 2,339 (30.4%)
   • Upregulated: 354 (4.6%)
   • Downregulated: 321 (4.2%)
   • Excluded: 14 (0.2%)

💾 SAVING RESULTS to ./data/results/hypoxia/dPFProtein_48hr_QuEStVar.feather
--------------------------------------------------------------------------------

📈 GENERATING QUESTVAR SUMMARY PLOT
--------------------------------------------------

With the introduction of dPF-based proteoform quantification, there is a noticeable shift in the QuEStVar results. The number of proteins classified as equivalent decreases proportionally, while the counts of upregulated and downregulated proteins increase. In the Antlers and MA plots there are wide wings that shows a lot of up/down regulated proteins, which are low-value imputed in one of the conditions that's why they look like that.

Note: Some might argue that this increase in differential expression detection could be due to increased noise from proteoform-level quantification. However, the distinct separation in PCA plots and improved biological relevance in downstream pathway analyses (not shown here) suggest that the dPF method captures meaningful proteoform-specific regulation rather than just noise.

Note: The addition of imputed low-intensity values in one condition can artificially inflate fold changes, leading to more proteins being classified as up/down regulated. Careful interpretation is needed to distinguish true biological regulation from artifacts introduced by imputation. This is why I implement multi-level imputation to not impute everything using same method. When we know that data is completely missing or >90% missing in a condition, we are more sure to impute it with a low-value method rather than using random forest or knn imputation. This makes the statistical tests to have those wings.


03.4 QuEStVar Analysis - 72hr Timepoint

Apply QuEStVar testing to compare True Hypoxia (1% O₂) vs Normoxia (21% O₂) at the 72-hour timepoint.

03.4.1 Original Protein Intensities

cond_1 = "True Hypoxia (1% O2) - 72 Hr"
cond_2 = "Normoxia (21% O2) - 72 Hr"

print(" RUNNING QUESTVAR STATISTICAL ANALYSIS")
print("=" * 80)
print(f"Parameters: p_thr={p_thr}, cv_thr={cv_thr}, df_thr={df_thr}, eq_thr={eq_thr}")
print(f"Correction: {correction.upper()}, Conditions: {cond_1} vs {cond_2}, Equal Variance: {equal_var}")

res, info = tests.run_questvar(
    S1_arr=original_protein[condition_to_samples[cond_1]].values,
    S2_arr=original_protein[condition_to_samples[cond_2]].values,
    is_log2=False,
    cv_thr=cv_thr,
    p_thr=p_thr,
    df_thr=df_thr,
    eq_thr=eq_thr,
    var_equal=equal_var,
    is_paired=False,
    correction=correction,
    allow_missing=True,
)

info = info.set_index( original_protein.index ).reset_index().set_index('Entry')

info['Status'] = info['Status'].replace({
    0.0: 'Unexplained',
    1.0: 'Equivalent',
    -1.0: 'Different',
}).fillna('Excluded')

quest = info.merge(
    res.iloc[:, :-1], 
    left_index=True, 
    right_index=True, 
    how='left'
)
quest['-log10(Fdr)'] = -np.log10(
    quest['df_adjp']
).replace([np.inf, -np.inf], np.nan)

quest['Status'] = np.where(
    quest['Status'] == 'Different',
    np.where(quest['log2FC'] > 0, 'Upregulated', 'Downregulated'),
    quest['Status']
)

id_col='Protein'
# Find the fasta info and map it, ensuring only one id_col is kept
cols = ['Gene', 'Description', 'Length', 'Weight(kDa)']#, 'Category', 'SPC']
qcols = [col for col in quest.columns if col != id_col]
quest = quest.merge(
    fasta_data[[id_col] + cols],
    left_on=id_col,
    right_on=id_col,
    how='left'
)
# Order the columns, keeping only one id_col
quest = quest[[id_col] + cols + qcols]

print()
print("📊 QUESTVAR ANALYSIS RESULTS")
print("-" * 60)

status_counts = quest['Status'].value_counts()
total_proteins = len(quest)

print(f"Total proteins analyzed: {total_proteins:,}")
print(f"Results breakdown:")
for status, count in status_counts.items():
    percentage = (count/total_proteins)*100
    print(f"   • {status}: {count:,} ({percentage:.1f}%)")
print()

# Add "Equivalent", "Upregulated", "Downregulated" to the dictionary
signf_protein_dict['72hr']["Original"] = {
    'Upregulated': quest.loc[quest['Status'] == 'Upregulated', 'Protein'].unique(),
    'Downregulated': quest.loc[quest['Status'] == 'Downregulated', 'Protein'].unique(),
    'Equivalent': quest.loc[quest['Status'] == 'Equivalent', 'Protein'].unique()
}
# Add thesattus counts to the protein_count_data DataFrame
protein_count_data = pd.concat(
    [
        protein_count_data,
        pd.DataFrame({
            'Method': ['Original'] * 5,
            'Duration': ['72hr'] * 5,
            'Count Type': status_counts.index.tolist(),
            'Count': status_counts.values.tolist()
        })
    ],
    axis=0,
    ignore_index=True
)

# Save results
output_data = f"{output_path}OriginalProtein_72hr_QuEStVar.feather"
print(f"💾 SAVING RESULTS to {output_data}")
print("-" * 80)
quest.to_feather(f"{output_data}")
print()

# Generate main summary plot
print("📈 GENERATING QUESTVAR SUMMARY PLOT")
print("-" * 50)

plots.questvar_test_summary(
    quest_data=quest,
    p_thr=p_thr,
    eq_thr=eq_thr,
    df_thr=df_thr,
    cv_thr=cv_thr,
    correction=correction,
    cond_1=cond_1,
    cond_2=cond_2,
    figsize=(20, 15),
    title_add=f"Original Protein Data: {cond_1} vs {cond_2}",
    show=True,
    save=save_to_folder,
    filename=f"QuEStVar_Summary_OriginalProtein_72hr_Hypoxia_vs_Normoxia",
    fileformats=figure_formats,
    filepath=figure_path,
    transparent=transparent_bg,
    dpi=figure_dpi
)
 RUNNING QUESTVAR STATISTICAL ANALYSIS
================================================================================
Parameters: p_thr=0.05, cv_thr=1, df_thr=0.5, eq_thr=0.25
Correction: FDR, Conditions: True Hypoxia (1% O2) - 72 Hr vs Normoxia (21% O2) - 72 Hr, Equal Variance: False

📊 QUESTVAR ANALYSIS RESULTS
------------------------------------------------------------
Total proteins analyzed: 9,029
Results breakdown:
   • Unexplained: 7,739 (85.7%)
   • Equivalent: 1,094 (12.1%)
   • Excluded: 157 (1.7%)
   • Upregulated: 23 (0.3%)
   • Downregulated: 16 (0.2%)

💾 SAVING RESULTS to ./data/results/hypoxia/OriginalProtein_72hr_QuEStVar.feather
--------------------------------------------------------------------------------

📈 GENERATING QUESTVAR SUMMARY PLOT
--------------------------------------------------

03.4.2 Top-3 Averaged Protein Intensities

cond_1 = "True Hypoxia (1% O2) - 72 Hr"
cond_2 = "Normoxia (21% O2) - 72 Hr"

print(" RUNNING QUESTVAR STATISTICAL ANALYSIS")
print("=" * 80)
print(f"Parameters: p_thr={p_thr}, cv_thr={cv_thr}, df_thr={df_thr}, eq_thr={eq_thr}")
print(f"Correction: {correction.upper()}, Conditions: {cond_1} vs {cond_2}, Equal Variance: {equal_var}")

res, info = tests.run_questvar(
    S1_arr=protein_top3[condition_to_samples[cond_1]].values,
    S2_arr=protein_top3[condition_to_samples[cond_2]].values,
    is_log2=False,
    cv_thr=cv_thr,
    p_thr=p_thr,
    df_thr=df_thr,
    eq_thr=eq_thr,
    var_equal=equal_var,
    is_paired=False,
    correction=correction,
    allow_missing=True,
)

info = info.set_index( protein_top3.index ).reset_index().set_index('Entry')

info['Status'] = info['Status'].replace({
    0.0: 'Unexplained',
    1.0: 'Equivalent',
    -1.0: 'Different',
}).fillna('Excluded')

quest = info.merge(
    res.iloc[:, :-1], 
    left_index=True, 
    right_index=True, 
    how='left'
)
quest['-log10(Fdr)'] = -np.log10(
    quest['df_adjp']
).replace([np.inf, -np.inf], np.nan)

quest['Status'] = np.where(
    quest['Status'] == 'Different',
    np.where(quest['log2FC'] > 0, 'Upregulated', 'Downregulated'),
    quest['Status']
)

id_col='Protein'
# Find the fasta info and map it, ensuring only one id_col is kept
cols = ['Gene', 'Description', 'Length', 'Weight(kDa)']#, 'Category', 'SPC']
qcols = [col for col in quest.columns if col != id_col]
quest = quest.merge(
    fasta_data[[id_col] + cols],
    left_on=id_col,
    right_on=id_col,
    how='left'
)
# Order the columns, keeping only one id_col
quest = quest[[id_col] + cols + qcols]

print()
print("📊 QUESTVAR ANALYSIS RESULTS")
print("-" * 60)

status_counts = quest['Status'].value_counts()
total_proteins = len(quest)

print(f"Total proteins analyzed: {total_proteins:,}")
print(f"Results breakdown:")
for status, count in status_counts.items():
    percentage = (count/total_proteins)*100
    print(f"   • {status}: {count:,} ({percentage:.1f}%)")
print()

# Add "Equivalent", "Upregulated", "Downregulated" to the dictionary
signf_protein_dict['72hr']["mean(top)"] = {
    'Upregulated': quest.loc[quest['Status'] == 'Upregulated', 'Protein'].unique(),
    'Downregulated': quest.loc[quest['Status'] == 'Downregulated', 'Protein'].unique(),
    'Equivalent': quest.loc[quest['Status'] == 'Equivalent', 'Protein'].unique()
}
# Add thesattus counts to the protein_count_data DataFrame
n_status = len(status_counts)
protein_count_data = pd.concat(
    [
        protein_count_data,
        pd.DataFrame({
            'Method': ['mean(top)'] * n_status,
            'Duration': ['72hr'] * n_status,
            'Count Type': status_counts.index.tolist(),
            'Count': status_counts.values.tolist()
        })
    ],
    axis=0,
    ignore_index=True
)

# Save results
output_data = f"{output_path}Top3Protein_72hr_QuEStVar.feather"
print(f"💾 SAVING RESULTS to {output_data}")
print("-" * 80)
quest.to_feather(f"{output_data}")
print()

# Generate main summary plot
print("📈 GENERATING QUESTVAR SUMMARY PLOT")
print("-" * 50)

plots.questvar_test_summary(
    quest_data=quest,
    p_thr=p_thr,
    eq_thr=eq_thr,
    df_thr=df_thr,
    cv_thr=cv_thr,
    correction=correction,
    cond_1=cond_1,
    cond_2=cond_2,
    figsize=(20, 15),
    title_add=f"Mean of Top3 Protein Data: {cond_1} vs {cond_2}",
    show=True,
    save=save_to_folder,
    filename=f"QuEStVar_Summary_Top3Protein_72hr_Hypoxia_vs_Normoxia",
    fileformats=figure_formats,
    filepath=figure_path,
    transparent=transparent_bg,
    dpi=figure_dpi
)
 RUNNING QUESTVAR STATISTICAL ANALYSIS
================================================================================
Parameters: p_thr=0.05, cv_thr=1, df_thr=0.5, eq_thr=0.25
Correction: FDR, Conditions: True Hypoxia (1% O2) - 72 Hr vs Normoxia (21% O2) - 72 Hr, Equal Variance: False

📊 QUESTVAR ANALYSIS RESULTS
------------------------------------------------------------
Total proteins analyzed: 7,161
Results breakdown:
   • Unexplained: 6,320 (88.3%)
   • Equivalent: 708 (9.9%)
   • Upregulated: 70 (1.0%)
   • Downregulated: 52 (0.7%)
   • Excluded: 11 (0.2%)

💾 SAVING RESULTS to ./data/results/hypoxia/Top3Protein_72hr_QuEStVar.feather
--------------------------------------------------------------------------------

📈 GENERATING QUESTVAR SUMMARY PLOT
--------------------------------------------------

03.4.3 All Peptides Averaged Protein Intensities

cond_1 = "True Hypoxia (1% O2) - 72 Hr"
cond_2 = "Normoxia (21% O2) - 72 Hr"

print(" RUNNING QUESTVAR STATISTICAL ANALYSIS")
print("=" * 80)
print(f"Parameters: p_thr={p_thr}, cv_thr={cv_thr}, df_thr={df_thr}, eq_thr={eq_thr}")
print(f"Correction: {correction.upper()}, Conditions: {cond_1} vs {cond_2}, Equal Variance: {equal_var}")

res, info = tests.run_questvar(
    S1_arr=protein_all[condition_to_samples[cond_1]].values,
    S2_arr=protein_all[condition_to_samples[cond_2]].values,
    is_log2=False,
    cv_thr=cv_thr,
    p_thr=p_thr,
    df_thr=df_thr,
    eq_thr=eq_thr,
    var_equal=equal_var,
    is_paired=False,
    correction=correction,
    allow_missing=True,
)

info = info.set_index( protein_all.index ).reset_index().set_index('Entry')

info['Status'] = info['Status'].replace({
    0.0: 'Unexplained',
    1.0: 'Equivalent',
    -1.0: 'Different',
}).fillna('Excluded')

quest = info.merge(
    res.iloc[:, :-1], 
    left_index=True, 
    right_index=True, 
    how='left'
)
quest['-log10(Fdr)'] = -np.log10(
    quest['df_adjp']
).replace([np.inf, -np.inf], np.nan)

quest['Status'] = np.where(
    quest['Status'] == 'Different',
    np.where(quest['log2FC'] > 0, 'Upregulated', 'Downregulated'),
    quest['Status']
)

id_col='Protein'
# Find the fasta info and map it, ensuring only one id_col is kept
cols = ['Gene', 'Description', 'Length', 'Weight(kDa)']#, 'Category', 'SPC']
qcols = [col for col in quest.columns if col != id_col]
quest = quest.merge(
    fasta_data[[id_col] + cols],
    left_on=id_col,
    right_on=id_col,
    how='left'
)
# Order the columns, keeping only one id_col
quest = quest[[id_col] + cols + qcols]

print()
print("📊 QUESTVAR ANALYSIS RESULTS")
print("-" * 60)

status_counts = quest['Status'].value_counts()
total_proteins = len(quest)

print(f"Total proteins analyzed: {total_proteins:,}")
print(f"Results breakdown:")
for status, count in status_counts.items():
    percentage = (count/total_proteins)*100
    print(f"   • {status}: {count:,} ({percentage:.1f}%)")
print()

# Add "Equivalent", "Upregulated", "Downregulated" to the dictionary
signf_protein_dict['72hr']["mean(all)"] = {
    'Upregulated': quest.loc[quest['Status'] == 'Upregulated', 'Protein'].unique(),
    'Downregulated': quest.loc[quest['Status'] == 'Downregulated', 'Protein'].unique(),
    'Equivalent': quest.loc[quest['Status'] == 'Equivalent', 'Protein'].unique()
}
# Add thesattus counts to the protein_count_data DataFrame
protein_count_data = pd.concat(
    [
        protein_count_data,
        pd.DataFrame({
            'Method': ['mean(all)'] * 5,
            'Duration': ['72hr'] * 5,
            'Count Type': status_counts.index.tolist(),
            'Count': status_counts.values.tolist()
        })
    ],
    axis=0,
    ignore_index=True
)

# Save results
output_data = f"{output_path}AllProtein_72hr_QuEStVar.feather"
print(f"💾 SAVING RESULTS to {output_data}")
print("-" * 80)
quest.to_feather(f"{output_data}")
print()

# Generate main summary plot
print("📈 GENERATING QUESTVAR SUMMARY PLOT")
print("-" * 50)

plots.questvar_test_summary(
    quest_data=quest,
    p_thr=p_thr,
    eq_thr=eq_thr,
    df_thr=df_thr,
    cv_thr=cv_thr,
    correction=correction,
    cond_1=cond_1,
    cond_2=cond_2,
    figsize=(20, 15),
    title_add=f"Mean of All Peptides - Protein Data: {cond_1} vs {cond_2}",
    show=True,
    save=save_to_folder,
    filename=f"QuEStVar_Summary_AllProtein_72hr_Hypoxia_vs_Normoxia",
    fileformats=figure_formats,
    filepath=figure_path,
    transparent=transparent_bg,
    dpi=figure_dpi
)
 RUNNING QUESTVAR STATISTICAL ANALYSIS
================================================================================
Parameters: p_thr=0.05, cv_thr=1, df_thr=0.5, eq_thr=0.25
Correction: FDR, Conditions: True Hypoxia (1% O2) - 72 Hr vs Normoxia (21% O2) - 72 Hr, Equal Variance: False

📊 QUESTVAR ANALYSIS RESULTS
------------------------------------------------------------
Total proteins analyzed: 7,161
Results breakdown:
   • Unexplained: 5,463 (76.3%)
   • Equivalent: 1,347 (18.8%)
   • Upregulated: 193 (2.7%)
   • Downregulated: 151 (2.1%)
   • Excluded: 7 (0.1%)

💾 SAVING RESULTS to ./data/results/hypoxia/AllProtein_72hr_QuEStVar.feather
--------------------------------------------------------------------------------

📈 GENERATING QUESTVAR SUMMARY PLOT
--------------------------------------------------

03.4.4 dPF-Based Proteoform Intensities

cond_1 = "True Hypoxia (1% O2) - 72 Hr"
cond_2 = "Normoxia (21% O2) - 72 Hr"

print(" RUNNING QUESTVAR STATISTICAL ANALYSIS")
print("=" * 80)
print(f"Parameters: p_thr={p_thr}, cv_thr={cv_thr}, df_thr={df_thr}, eq_thr={eq_thr}")
print(f"Correction: {correction.upper()}, Conditions: {cond_1} vs {cond_2}, Equal Variance: {equal_var}")

res, info = tests.run_questvar(
    S1_arr=protein_dpf_72[condition_to_samples[cond_1]].values,
    S2_arr=protein_dpf_72[condition_to_samples[cond_2]].values,
    is_log2=False,
    cv_thr=cv_thr,
    p_thr=p_thr,
    df_thr=df_thr,
    eq_thr=eq_thr,
    var_equal=equal_var,
    is_paired=False,
    correction=correction,
    allow_missing=True,
)

info = info.set_index( protein_dpf_72.index ).reset_index().set_index('Entry')

info['Status'] = info['Status'].replace({
    0.0: 'Unexplained',
    1.0: 'Equivalent',
    -1.0: 'Different',
}).fillna('Excluded')

quest = info.merge(
    res.iloc[:, :-1], 
    left_index=True, 
    right_index=True, 
    how='left'
)
quest['-log10(Fdr)'] = -np.log10(
    quest['df_adjp']
).replace([np.inf, -np.inf], np.nan)

quest['Status'] = np.where(
    quest['Status'] == 'Different',
    np.where(quest['log2FC'] > 0, 'Upregulated', 'Downregulated'),
    quest['Status']
)

if 'Proteoform' in quest.columns:
    quest['Protein'] = quest['Proteoform'].str.split('_').str[0]

id_col='Protein'
# Find the fasta info and map it, ensuring only one id_col is kept
cols = ['Gene', 'Description', 'Length', 'Weight(kDa)']#, 'Category', 'SPC']
qcols = [col for col in quest.columns if col != id_col]
quest = quest.merge(
    fasta_data[[id_col] + cols],
    left_on=id_col,
    right_on=id_col,
    how='left'
)
# Order the columns, keeping only one id_col
quest = quest[[id_col] + cols + qcols]

print()
print("📊 QUESTVAR ANALYSIS RESULTS")
print("-" * 60)

status_counts = quest['Status'].value_counts()
total_proteins = len(quest)

print(f"Total proteins analyzed: {total_proteins:,}")
print(f"Results breakdown:")
for status, count in status_counts.items():
    percentage = (count/total_proteins)*100
    print(f"   • {status}: {count:,} ({percentage:.1f}%)")
print()

# Add "Equivalent", "Upregulated", "Downregulated" to the dictionary
signf_protein_dict['72hr']["dPFs"] = {
    'Upregulated': quest.loc[quest['Status'] == 'Upregulated', 'Protein'].unique(),
    'Downregulated': quest.loc[quest['Status'] == 'Downregulated', 'Protein'].unique(),
    'Equivalent': quest.loc[quest['Status'] == 'Equivalent', 'Protein'].unique()
}
# Add thesattus counts to the protein_count_data DataFrame
protein_count_data = pd.concat(
    [
        protein_count_data,
        pd.DataFrame({
            'Method': ['dPFs'] * 5,
            'Duration': ['72hr'] * 5,
            'Count Type': status_counts.index.tolist(),
            'Count': status_counts.values.tolist()
        })
    ],
    axis=0,
    ignore_index=True
)

# Save results
output_data = f"{output_path}dPFProtein_72hr_QuEStVar.feather"
print(f"💾 SAVING RESULTS to {output_data}")
print("-" * 80)
quest.to_feather(f"{output_data}")
print()

# Generate main summary plot
print("📈 GENERATING QUESTVAR SUMMARY PLOT")
print("-" * 50)

plots.questvar_test_summary(
    quest_data=quest,
    p_thr=p_thr,
    eq_thr=eq_thr,
    df_thr=df_thr,
    cv_thr=cv_thr,
    correction=correction,
    cond_1=cond_1,
    cond_2=cond_2,
    figsize=(20, 15),
    title_add=f"dPFs -  Protein Data: {cond_1} vs {cond_2}",
    show=True,
    save=save_to_folder,
    filename=f"QuEStVar_Summary_dPFProtein_72hr_Hypoxia_vs_Normoxia",
    fileformats=figure_formats,
    filepath=figure_path,
    transparent=transparent_bg,
    dpi=figure_dpi
)
 RUNNING QUESTVAR STATISTICAL ANALYSIS
================================================================================
Parameters: p_thr=0.05, cv_thr=1, df_thr=0.5, eq_thr=0.25
Correction: FDR, Conditions: True Hypoxia (1% O2) - 72 Hr vs Normoxia (21% O2) - 72 Hr, Equal Variance: False

📊 QUESTVAR ANALYSIS RESULTS
------------------------------------------------------------
Total proteins analyzed: 8,299
Results breakdown:
   • Unexplained: 5,469 (65.9%)
   • Equivalent: 1,363 (16.4%)
   • Downregulated: 743 (9.0%)
   • Upregulated: 699 (8.4%)
   • Excluded: 25 (0.3%)

💾 SAVING RESULTS to ./data/results/hypoxia/dPFProtein_72hr_QuEStVar.feather
--------------------------------------------------------------------------------

📈 GENERATING QUESTVAR SUMMARY PLOT
--------------------------------------------------

03.5 Significant Protein Status Comparison

Visualize and compare the distribution of protein statuses (Upregulated, Downregulated, Equivalent, Unexplained, Excluded) across quantification methods and timepoints.

# --- Data Preparation (Assuming 'protein_count_data' is defined) ---
# This part is the same as your original code.
plot_data = protein_count_data[protein_count_data["Count Type"] != "All Proteins"].copy()
group_totals = plot_data.groupby(["Method", "Duration"])["Count"].sum().reset_index().rename(columns={"Count": "Total"})
plot_data = plot_data.merge(group_totals, on=["Method", "Duration"])
plot_data["Percent"] = plot_data["Count"] / plot_data["Total"] * 100


# --- Enforce Categorical Order ---
# This ensures plotting and sorting functions respect your desired order.
plot_data['Method'] = pd.Categorical(plot_data['Method'], categories=method_order, ordered=True)
plot_data['Count Type'] = pd.Categorical(plot_data['Count Type'], categories=count_type_order, ordered=True)
plot_data = plot_data.sort_values(['Method', 'Count Type'])


# --- Plotting ---
durations = sorted(plot_data["Duration"].unique())
fig, axes = plt.subplots(1, len(durations), figsize=(12, 6), sharey=True, squeeze=False)
axes = axes.flatten() # Ensure axes is always a flat array

for i, duration in enumerate(durations):
    ax = axes[i]
    data_dur = plot_data[plot_data["Duration"] == duration]

    # Pivot data for easier plotting
    df_pivot = data_dur.pivot(index='Method', columns='Count Type', values='Percent')

    # Create the stacked bar plot
    df_pivot.plot(
        kind='bar',
        stacked=True,
        ax=ax,
        color=[status_colors.get(col, '#000000') for col in df_pivot.columns],
        width=0.8,
        edgecolor="white",
        linewidth=0.5
    )

    # --- Add Annotations ---s
    for container in ax.containers:
        for bar, label in zip(container, container.datavalues):
            percent = label
            # Only annotate if value is large enough
            if percent > 4:
                # Get the status (Count Type) for this bar
                count_type = bar.get_label() if hasattr(bar, 'get_label') else None
                # Fallback: use the color of the bar to decide text color
                bar_color = bar.get_facecolor()
                # Convert RGBA to perceived brightness
                brightness = (0.299 * bar_color[0] + 0.587 * bar_color[1] + 0.114 * bar_color[2])
                text_color = 'black' if brightness > 0.6 else 'white'
                # Try to use status_colors for text color if possible
                if count_type in status_colors:
                    # Use white text for dark bars, black for light bars
                    hex_color = status_colors[count_type].lstrip('#')
                    r, g, b = tuple(int(hex_color[i:i+2], 16)/255. for i in (0, 2, 4))
                    brightness = (0.299 * r + 0.587 * g + 0.114 * b)
                    text_color = 'black' if brightness > 0.6 else 'white'
                ax.bar_label(
                    container,
                    labels=[f'{percent:.0f}%' if percent > 4 else '' for percent in container.datavalues],
                    label_type='center',
                    color=text_color,
                    fontsize=12,
                    fontweight='bold'
                )
                break  # Only need to annotate once per bar group

    # --- Styling ---
    ax.set_title(f"Duration: {duration}", fontsize=14, fontweight='bold')
    ax.set_xlabel("") # X-label is shared, set below
    ax.tick_params(axis='x', labelrotation=0, labelsize=12)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.yaxis.grid(True, linestyle='--', linewidth=0.6, color='gray', alpha=0.4)
    ax.set_axisbelow(True)
    ax.get_legend().remove() # Remove individual legends

# --- Final Figure-Level Adjustments ---
# Set common labels
fig.text(0.5, 0.02, 'Method', ha='center', va='center', fontsize=14, fontweight='bold')
axes[0].set_ylabel("Percentage (%)", fontsize=14, fontweight='bold')
axes[0].tick_params(axis='y', labelsize=12)
axes[0].set_ylim(0, 100)

# Create a single, shared legend
handles, labels = axes[0].get_legend_handles_labels()
fig.legend(
    handles,
    labels,
    title="Protein Status",
    loc='upper left',
    bbox_to_anchor=(.90, 0.9),
    fontsize=12,
    title_fontsize=13,
    frameon=False
)

fig.suptitle('Breakdown of Protein Status by Method and Duration', fontsize=18, fontweight='bold')

# Adjust layout to prevent overlap and make space for legend
plt.tight_layout(rect=[0, 0.04, 0.9, 0.95])
plots.finalize_plot(
    fig=fig,
    show=True,
    save=save_to_folder,
    filename="Protein_Status_Breakdown",
    filepath=figure_path,
    formats=figure_formats, 
    transparent=transparent_bg,
    dpi=figure_dpi,
)

While each summary plot for each QuEStVar done can be summarized further by comparing the counts of proteins in each status category across methods and timepoints. This allows us to see how proteoform-aware quantification influences the overall classification of protein regulation.

From the plot the dPFs method from ProteoForge immediately stands out and shown to be making more of the proteins statistically explainable compared to other methods. Keeping the proportion of statistically equivalent similar as the other methods and even more than Original and mean(top) methods while increasing the proportion of up/down regulated proteins detected. While decreasing the statistically unexplained proteins proportion to lowest among all methods.

This highlights the advantage of proteoform-aware quantification in revealing biologically relevant regulation that may be missed by traditional protein-level approaches.


04. Individual Protein Case Studies

Here we generate detailed boxplots for individual proteins to visualize intensity distributions across all quantification methods, conditions, and timepoints. This allows examination of specific cases where dPF-based analysis reveals different regulatory patterns than traditional methods.

def melt_protein_df(df, method, id_col='Protein', extra_assign=None):
    melted = df.reset_index().melt(
        id_vars=id_col,
        var_name='Sample',
        value_name='Intensity'
    )
    melted['Method'] = method
    melted['log10Intensity'] = np.log10(melted['Intensity'])
    melted['Condition'] = np.where(melted['Sample'].str.contains('1%'), 'Hypoxia (1% O2)', 'Normoxia (21% O2)')
    melted['Duration'] = np.where(melted['Sample'].str.contains('48'), '48hr', '72hr')
    melted['Proteoform'] = melted[id_col]
    if extra_assign:
        for k, v in extra_assign.items():
            melted[k] = v(melted)
    return melted

# Main protein-level methods
plot_data = pd.concat([
    melt_protein_df(protein_top3, 'mean(top)'),
    melt_protein_df(protein_all, 'mean(all)'),
    melt_protein_df(original_protein, 'Original')
], ignore_index=True)

# protein from dPFs for 48hr and 72hr
for dpf_df, duration in [(protein_dpf_48, '48hr'), (protein_dpf_72, '72hr')]:
    melted = dpf_df.reset_index().melt(
        id_vars='Proteoform',
        var_name='Sample',
        value_name='Intensity'
    )
    # Split Proteoform into Protein and dPF number
    melted[['Protein', 'dPF_num']] = melted['Proteoform'].str.rsplit('_', n=1, expand=True)
    # Set Method as dPF_<dPF_num>
    melted['Method'] = 'dPF_' + melted['dPF_num']
    melted['log10Intensity'] = np.log10(melted['Intensity'])
    melted['Condition'] = np.where(melted['Sample'].str.contains('1%'), 'Hypoxia (1% O2)', 'Normoxia (21% O2)')
    melted['Duration'] = duration
    melted = melted.drop(columns=['dPF_num'])
    plot_data = pd.concat([plot_data, melted], ignore_index=True)

# quest_info_to place in the plot_data
cols_to_pick = ['Protein', 'Gene', 'Description', 'average', 'log2FC', 'comb_adjp', 'Status']
# quest = pd.read_feather(f"{output_path}AllProtein_48hr_QuEStVar.feather")
# OriginalProtein, Top3Protein, AllProtein, dPFProtein
# 48hr and 72hr
name_method_map = {
    'Original': 'OriginalProtein',
    'mean(top)': 'Top3Protein',
    'mean(all)': 'AllProtein',
}
quest_data = pd.DataFrame()
for method, file_suffix in name_method_map.items():
    for duration in ['48hr', '72hr']:
        file_path = f"{output_path}{file_suffix}_{duration}_QuEStVar.feather"
        if os.path.exists(file_path):
            temp_quest = pd.read_feather(file_path)[cols_to_pick]
            temp_quest['Method'] = method
            temp_quest['Duration'] = duration
            temp_quest['Proteoform'] = temp_quest['Protein']  # For consistency
            quest_data = pd.concat([quest_data, temp_quest], ignore_index=True)

# Add the dPFs data
for duration in ['48hr', '72hr']:
    file_path = f"{output_path}dPFProtein_{duration}_QuEStVar.feather"
    if os.path.exists(file_path):
        temp_quest = pd.read_feather(file_path)[cols_to_pick + ['Proteoform']]
        # temp_quest['Method'] = 'dPFs'
        temp_quest['Method'] = temp_quest['Proteoform'].str.split('_').str[1].apply(lambda x: 'dPF_' + x)
        temp_quest['Duration'] = duration
        quest_data = pd.concat([quest_data, temp_quest], ignore_index=True)

# Remove Protein column to avoid confusion
quest_data = quest_data.drop(columns=['Protein'])
# Merge quest_data into plot_data
plot_data = plot_data.merge(
    quest_data,
    on=['Proteoform', 'Method', 'Duration'],
    how='left',
)[[
    'Protein', 'Proteoform', 'Gene', 'Description', 
    'Method', 'Duration', 'Sample', 
    'Intensity', 'log10Intensity', 
    'Condition', 'Status', 'average', 'log2FC', 'comb_adjp'
]]

plot_data
Protein Proteoform Gene Description Method Duration Sample Intensity log10Intensity Condition Status average log2FC comb_adjp
0 A0A024RBG1 A0A024RBG1 NUDT4B Diphosphoinositol polyphosphate phosphohydrola... mean(top) 48hr 1%_48_1 42434.0877 4.6277 Hypoxia (1% O2) Equivalent 15.4190 -0.0542 0.0403
1 A0A0B4J2D5 A0A0B4J2D5 GATD3B Putative glutamine amidotransferase-like class... mean(top) 48hr 1%_48_1 102861.7619 5.0123 Hypoxia (1% O2) Unexplained 16.6117 0.0899 0.1765
2 A0A1B0GTU1 A0A1B0GTU1 ZC3H11B Zinc finger CCCH domain-containing protein 11B mean(top) 48hr 1%_48_1 15418.3217 4.1880 Hypoxia (1% O2) Unexplained 13.9667 0.0164 0.1247
3 A0A1W2PQL4 A0A1W2PQL4 ZNF722 Zinc finger protein 722 mean(top) 48hr 1%_48_1 138883.8369 5.1427 Hypoxia (1% O2) Unexplained 17.1534 -0.3106 0.4431
4 A0A2R8Y4L2 A0A2R8Y4L2 HNRNPA1L3 Heterogeneous nuclear ribonucleoprotein A1-like 3 mean(top) 48hr 1%_48_1 2516111.9537 6.4007 Hypoxia (1% O2) Equivalent 21.0464 -0.0062 0.0270
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
438183 Q9Y6X8 Q9Y6X8_1 ZHX2 Zinc fingers and homeoboxes protein 2 dPF_1 72hr C_72_3 17.1938 1.2354 Normoxia (21% O2) Upregulated 8.0065 7.4581 0.0001
438184 Q9Y6X9 Q9Y6X9_0 MORC2 ATPase MORC2 dPF_0 72hr C_72_3 6221.2750 3.7939 Normoxia (21% O2) Unexplained 12.4735 -0.2845 0.1865
438185 Q9Y6Y0 Q9Y6Y0_0 IVNS1ABP Influenza virus NS1A-binding protein dPF_0 72hr C_72_3 8806.4371 3.9448 Normoxia (21% O2) Unexplained 12.9802 -0.2372 0.5863
438186 Q9Y6Y8 Q9Y6Y8_0 SEC23IP SEC23-interacting protein dPF_0 72hr C_72_3 7182.0966 3.8563 Normoxia (21% O2) Equivalent 12.8089 -0.0829 0.0447
438187 Q9Y6Y8 Q9Y6Y8_1 SEC23IP SEC23-interacting protein dPF_1 72hr C_72_3 3242.0561 3.5108 Normoxia (21% O2) Downregulated 8.3132 -6.9870 0.0108

438188 rows × 14 columns

all_methods_order = plot_data['Method'].unique().tolist()
all_methods_order = ['Original', 'mean(top)', 'mean(all)'] + sorted([m for m in all_methods_order if m.startswith('dPF_')], key=lambda x: int(x.split('_')[1]))

summary = plot_data.pivot_table(
    columns=['Method', 'Duration'],
    index='Protein',
    values='Status',
    aggfunc='first'
).fillna('No Data')
# Order the Method by method_order
summary = summary.reindex(columns=pd.MultiIndex.from_product([all_methods_order, ['48hr', '72hr']]), fill_value='No Data')
summary['Total Methods'] = (summary != 'No Data').sum(axis=1)
summary = summary.sort_values(['Total Methods'] + [(method, dur) for method in all_methods_order for dur in ['48hr', '72hr']], ascending=False)
summary = summary.reset_index()
# Rulests
# Remove proteins with Excluded
summary = summary[~summary.isin(['Excluded']).any(axis=1)]
# Fast vectorized version for consistency check
# 1. Get base statuses for Original, mean(top), mean(all) for both durations
base_methods = ['Original', 'mean(top)', 'mean(all)']
dpf_methods = [m for m in all_methods_order if m.startswith('dPF_')]
durations = ['48hr', '72hr']

# Get the columns for base and dPFs
base_cols = [(m, d) for m in base_methods for d in durations]
dpf_cols = [(m, d) for m in dpf_methods for d in durations]

# Convert to numpy for fast access
summary_arr = summary[[col for col in summary.columns if col not in ['Protein', 'Total Methods']]].values

# Map column index for each method/duration
col_idx = {col: i for i, col in enumerate(summary.columns) if col not in ['Protein', 'Total Methods']}

# Precompute indices
base_indices = [col_idx[c] for c in base_cols]
dpf_indices = [col_idx[c] for c in dpf_cols]

# Get base status for each row (if consistent and not 'No Data')
def get_base_status(row):
    vals = row[base_indices]
    vals = vals[vals != 'No Data']
    if len(vals) == 0:
        return None
    if np.all(vals == vals[0]):
        return vals[0]
    return None

base_statuses = np.apply_along_axis(get_base_status, 1, summary_arr)

# Get dPF statuses for each row
dpf_statuses = summary_arr[:, dpf_indices]

# For each row, check if any dPF status is different from base_status (and not 'No Data')
mask = []
for i, base_status in enumerate(base_statuses):
    if base_status is None or base_status == 'No Data':
        mask.append(False)
        continue
    dpf_vals = dpf_statuses[i]
    dpf_vals = dpf_vals[dpf_vals != 'No Data']
    if len(dpf_vals) == 0:
        mask.append(False)
    else:
        mask.append(np.any(dpf_vals != base_status))
summary = summary[np.array(mask)]
# If Original has No Data, don't include
summary = summary[
    (summary[('Original', '48hr')] != 'No Data') &
    (summary[('Original', '72hr')] != 'No Data')
]
summary.head(25)
Protein Original mean(top) mean(all) dPF_0 dPF_1 dPF_2 dPF_3 Total Methods
48hr 72hr 48hr 72hr 48hr 72hr 48hr 72hr 48hr 72hr 48hr 72hr 48hr 72hr
1 Q15149 Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Upregulated Upregulated Downregulated Downregulated No Data No Data 12
2 Q8WXH0 Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Equivalent Downregulated Upregulated Upregulated Downregulated No Data No Data 12
3 Q9UPN3 Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Equivalent Downregulated Upregulated Upregulated Downregulated No Data No Data 12
4 P46821 Equivalent Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Downregulated Upregulated Upregulated Downregulated No Data No Data 12
5 O15020 Equivalent Unexplained Unexplained Equivalent Unexplained Unexplained Unexplained Unexplained Downregulated Upregulated Upregulated Downregulated No Data No Data 12
6 Q9NU22 Equivalent Unexplained Equivalent Unexplained Equivalent Unexplained Equivalent Unexplained Upregulated Upregulated Downregulated Downregulated No Data No Data 12
7 Q9Y4A5 Equivalent Equivalent Unexplained Unexplained Equivalent Equivalent Equivalent Equivalent Upregulated Downregulated Downregulated Upregulated No Data No Data 12
8 P78527 Equivalent Equivalent Equivalent Unexplained Equivalent Equivalent Equivalent Equivalent Upregulated Upregulated Downregulated Unexplained No Data No Data 12
9 P50851 Equivalent Equivalent Equivalent Equivalent Equivalent Equivalent Equivalent Equivalent Upregulated Upregulated Downregulated Downregulated No Data No Data 12
10 Q8IVF2 Upregulated Upregulated Upregulated Unexplained Upregulated Upregulated No Data Upregulated Upregulated Upregulated Upregulated Downregulated No Data No Data 11
11 O14578 Unexplained Unexplained Upregulated Upregulated Unexplained Equivalent Unexplained Unexplained Upregulated Upregulated Downregulated No Data No Data No Data 11
12 O94915 Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Upregulated Upregulated No Data Downregulated No Data No Data 11
13 P55011 Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Downregulated Upregulated Upregulated No Data No Data No Data 11
14 Q14789 Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Downregulated Downregulated No Data Upregulated No Data No Data 11
16 P15924 Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained No Data Upregulated Upregulated Downregulated Unexplained No Data No Data 11
18 O94854 Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Equivalent Upregulated Upregulated Downregulated No Data No Data No Data 11
19 Q96Q89 Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Equivalent Unexplained Downregulated Downregulated No Data Upregulated No Data No Data 11
20 Q96SN8 Unexplained Unexplained Unexplained Unexplained Unexplained Unexplained Equivalent Unexplained Downregulated Downregulated No Data Upregulated No Data No Data 11
21 Q9H7D0 Unexplained Unexplained Unexplained Unexplained Unexplained Equivalent Unexplained Equivalent Upregulated Upregulated No Data Downregulated No Data No Data 11
22 Q8N3C0 Unexplained Unexplained Unexplained Unexplained Unexplained Equivalent Unexplained Equivalent Downregulated Upregulated No Data Downregulated No Data No Data 11
23 Q13459 Unexplained Unexplained Unexplained Unexplained Unexplained Equivalent Unexplained Equivalent Downregulated Downregulated Upregulated No Data No Data No Data 11
24 Q01082 Unexplained Unexplained Unexplained Unexplained Equivalent Unexplained No Data Unexplained Upregulated Upregulated Equivalent Downregulated No Data No Data 11
25 O95714 Unexplained Unexplained Unexplained Unexplained Equivalent Unexplained Equivalent Unexplained Downregulated Downregulated No Data Upregulated No Data No Data 11
26 Q03001 Unexplained Unexplained Unexplained Unexplained Equivalent Equivalent No Data Equivalent Downregulated Upregulated Equivalent Downregulated No Data No Data 11
27 Q03164 Unexplained Unexplained Unexplained Unexplained Equivalent Equivalent Equivalent Equivalent Upregulated Downregulated No Data Upregulated No Data No Data 11
is_demo = True # Set to False to save plot
current_protein = 'P50851'

subset_plot = plot_data[plot_data['Protein'] == current_protein]
# Define the order of methods for consistent plotting
dPF_methods = sorted(subset_plot['Method'].unique())
dPF_methods = [m for m in dPF_methods if m.startswith('dPF_')]
main_methods = ['Original', 'mean(top)', 'mean(all)']
method_order = main_methods + dPF_methods
method_order
# 
# Set categories with the defined order
subset_plot['Method'] = pd.Categorical(subset_plot['Method'], categories=method_order, ordered=True)
subset_plot['Duration'] = pd.Categorical(subset_plot['Duration'], categories=['48hr', '72hr'], ordered=True)
subset_plot['Condition'] = pd.Categorical(subset_plot['Condition'], categories=['Normoxia (21% O2)', 'Hypoxia (1% O2)'], ordered=True)
# Sort the DataFrame based on the categorical order
subset_plot = subset_plot.sort_values(['Method', 'Duration', 'Condition', 'Proteoform']).reset_index(drop=True)
# subset_plot.to_csv("debug_subset_plot.csv", index=False)
subset_plot
Protein Proteoform Gene Description Method Duration Sample Intensity log10Intensity Condition Status average log2FC comb_adjp
0 P50851 P50851 LRBA Lipopolysaccharide-responsive and beige-like a... Original 48hr C_48_1 153304.3858 5.1856 Normoxia (21% O2) Equivalent 17.2353 0.0167 0.0198
1 P50851 P50851 LRBA Lipopolysaccharide-responsive and beige-like a... Original 48hr C_48_2 166143.9163 5.2205 Normoxia (21% O2) Equivalent 17.2353 0.0167 0.0198
2 P50851 P50851 LRBA Lipopolysaccharide-responsive and beige-like a... Original 48hr C_48_3 156653.4033 5.1949 Normoxia (21% O2) Equivalent 17.2353 0.0167 0.0198
3 P50851 P50851 LRBA Lipopolysaccharide-responsive and beige-like a... Original 48hr C_48_4 138785.0132 5.1423 Normoxia (21% O2) Equivalent 17.2353 0.0167 0.0198
4 P50851 P50851 LRBA Lipopolysaccharide-responsive and beige-like a... Original 48hr 1%_48_1 156376.3783 5.1942 Hypoxia (1% O2) Equivalent 17.2353 0.0167 0.0198
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
79 P50851 P50851_2 LRBA Lipopolysaccharide-responsive and beige-like a... dPF_2 72hr C_72_2 1848.7562 3.2669 Normoxia (21% O2) Downregulated 7.9915 -6.4031 0.0004
80 P50851 P50851_2 LRBA Lipopolysaccharide-responsive and beige-like a... dPF_2 72hr C_72_3 2741.9365 3.4381 Normoxia (21% O2) Downregulated 7.9915 -6.4031 0.0004
81 P50851 P50851_2 LRBA Lipopolysaccharide-responsive and beige-like a... dPF_2 72hr 1%_72_2 29.2296 1.4658 Hypoxia (1% O2) Downregulated 7.9915 -6.4031 0.0004
82 P50851 P50851_2 LRBA Lipopolysaccharide-responsive and beige-like a... dPF_2 72hr 1%_72_3 21.4135 1.3307 Hypoxia (1% O2) Downregulated 7.9915 -6.4031 0.0004
83 P50851 P50851_2 LRBA Lipopolysaccharide-responsive and beige-like a... dPF_2 72hr 1%_72_4 33.8256 1.5292 Hypoxia (1% O2) Downregulated 7.9915 -6.4031 0.0004

84 rows × 14 columns

from matplotlib.lines import Line2D
from matplotlib.patches import Patch, Rectangle

# Get current gene and description
try:
    current_gene = subset_plot['Gene'].dropna().unique()[0]
    current_description = subset_plot['Description'].dropna().unique()[0]
except (IndexError, KeyError):
    current_gene = "Unknown Gene"
    current_description = "No description available"

# --- Plot Setup ---
yOffset = 0.35
ymin = subset_plot['log10Intensity'].min() - yOffset
ymax = subset_plot['log10Intensity'].max() + yOffset

# Calculate marker y position and extended ylim
status_marker_y = ymax + 0.13 * (ymax - ymin)
ylim_top = status_marker_y + 0.08 * (ymax - ymin)

fig, ax = plt.subplots(figsize=(16, 6))
fig.patch.set_facecolor('white')

xtick_positions = np.arange(len(method_order))
xtick_labels = method_order

total_width = 0.8
box_width = total_width / 6
spacing = total_width / 12

boxplot_offsets = {
    ('48hr', 'Normoxia (21% O2)'): -total_width/2 + box_width/2,
    ('48hr', 'Hypoxia (1% O2)'): -total_width/2 + box_width/2 + box_width + spacing,
    ('72hr', 'Normoxia (21% O2)'): -total_width/2 + box_width/2 + 2*(box_width + spacing),
    ('72hr', 'Hypoxia (1% O2)'): -total_width/2 + box_width/2 + 3*(box_width + spacing)
}
duration_center_offsets = {'48hr': -0.2, '72hr': 0.2}
jitter_amount = 0.055

# --- Background rectangles for Duration (full ylim) ---
for i, method in enumerate(method_order):
    for duration in ['48hr', '72hr']:
        offset_norm = boxplot_offsets[(duration, 'Normoxia (21% O2)')]
        offset_hyp = boxplot_offsets[(duration, 'Hypoxia (1% O2)')]
        rect_x_start = xtick_positions[i] + min(offset_norm, offset_hyp) - box_width/2
        rect_x_end = xtick_positions[i] + max(offset_norm, offset_hyp) + box_width/2
        rect_width = rect_x_end - rect_x_start
        rect_height = ylim_top - ymin
        rect = Rectangle(
            (rect_x_start, ymin), rect_width, rect_height,
            facecolor=dura_colors[duration], 
            alpha=0.15, 
            edgecolor='none',
            zorder=1
        )
        ax.add_patch(rect)

# --- Skeleton boxplots ---
for i, method in enumerate(method_order):
    for (duration, condition), offset in boxplot_offsets.items():
        data_subset = subset_plot[
            (subset_plot['Method'] == method) & 
            (subset_plot['Duration'] == duration) & 
            (subset_plot['Condition'] == condition)
        ]
        if data_subset.empty:
            continue
        x_center = xtick_positions[i] + offset
        intensities = data_subset['log10Intensity'].dropna()
        if len(intensities) > 0:
            q1 = intensities.quantile(0.25)
            median = intensities.median()
            q3 = intensities.quantile(0.75)
            iqr = q3 - q1
            lower_whisker = max(intensities.min(), q1 - 1.5 * iqr)
            upper_whisker = min(intensities.max(), q3 + 1.5 * iqr)
            box_x = x_center - box_width/2
            box_rect = Rectangle(
                (box_x, q1), box_width, q3-q1,
                facecolor='none',  # transparent fill
                edgecolor='#333333',
                linewidth=1.5,
                alpha=0.9,
                zorder=4
            )
            ax.add_patch(box_rect)
            ax.plot([box_x, box_x + box_width], [median, median], 
                    color='#d62728', linewidth=2.5, zorder=5, alpha=0.9)
            if upper_whisker > q3:
                ax.plot([x_center, x_center], [q3, upper_whisker], 
                        color='#333333', linewidth=1.5, zorder=4, alpha=0.8)
                cap_width = box_width * 0.5
                ax.plot([x_center - cap_width/2, x_center + cap_width/2], 
                        [upper_whisker, upper_whisker], color='#333333', linewidth=1.5, zorder=4, alpha=0.8)
            if lower_whisker < q1:
                ax.plot([x_center, x_center], [q1, lower_whisker], 
                        color='#333333', linewidth=1.5, zorder=4, alpha=0.8)
                cap_width = box_width * 0.5
                ax.plot([x_center - cap_width/2, x_center + cap_width/2], 
                        [lower_whisker, lower_whisker], color='#333333', linewidth=1.5, zorder=4, alpha=0.8)

# --- Jittered points ---
for i, method in enumerate(method_order):
    for (duration, condition), offset in boxplot_offsets.items():
        data_subset = subset_plot[
            (subset_plot['Method'] == method) & 
            (subset_plot['Duration'] == duration) & 
            (subset_plot['Condition'] == condition)
        ]
        if data_subset.empty:
            continue
        x_base = xtick_positions[i] + offset
        color = cond_colors[condition]
        np.random.seed(42)
        x_jittered = x_base + np.random.uniform(-jitter_amount, jitter_amount, size=len(data_subset))
        label = condition if (i == 0 and duration == '48hr') else ""
        ax.scatter(
            x_jittered,
            data_subset['log10Intensity'],
            color=color,
            label=label,
            alpha=0.85,
            edgecolor='#333333',
            linewidth=0.6,
            s=75,
            zorder=6
        )

# --- Connecting lines for same Proteoform ---
for i, method in enumerate(method_order):
    for duration in ['48hr', '72hr']:
        data_subset = subset_plot[(subset_plot['Method'] == method) & (subset_plot['Duration'] == duration)]
        if data_subset.empty:
            continue
        for proteoform in data_subset['Proteoform'].unique():
            proteoform_data = data_subset[data_subset['Proteoform'] == proteoform]
            if len(proteoform_data['Condition'].unique()) == 2:
                sorted_data = proteoform_data.sort_values('Condition')
                y_vals = sorted_data['log10Intensity'].values
                conds = sorted_data['Condition'].values
                x_vals = []
                for cond in conds:
                    offset = boxplot_offsets[(duration, cond)]
                    x_base = xtick_positions[i] + offset
                    x_vals.append(x_base)
                if len(x_vals) == 2 and len(y_vals) == 2:
                    ax.plot(
                        x_vals,
                        y_vals,
                        color='#666666',
                        alpha=0.4,
                        linewidth=1.2,
                        linestyle='-',
                        zorder=3
                    )

# --- Statistical test lines and status markers ---
for i, method in enumerate(method_order):
    for duration in ['48hr', '72hr']:
        offset_norm = boxplot_offsets[(duration, 'Normoxia (21% O2)')]
        offset_hyp = boxplot_offsets[(duration, 'Hypoxia (1% O2)')]
        x_norm = xtick_positions[i] + offset_norm
        x_hyp = xtick_positions[i] + offset_hyp
        ax.plot([x_norm, x_hyp], [status_marker_y, status_marker_y], 
                color='#333333', lw=1.5, zorder=7, alpha=0.8)
        method_duration_data = subset_plot[
            (subset_plot['Method'] == method) & 
            (subset_plot['Duration'] == duration)
        ]
        conds_present = set(method_duration_data['Condition'].unique())
        required_conds = {'Normoxia (21% O2)', 'Hypoxia (1% O2)'}
        if (required_conds.issubset(conds_present) and 
            not method_duration_data.empty and 
            not method_duration_data['Status'].isnull().all()):
            try:
                status = method_duration_data['Status'].mode()[0]
            except IndexError:
                status = ''
        else:
            status = ''
        x_center = (x_norm + x_hyp) / 2
        if status != '':
            marker = status_markers.get(status, 'o')
            color = status_colors.get(status, '#C2C0C0')
            ax.scatter(
                x_center, status_marker_y,
                marker=marker,
                color=color,
                edgecolor='#333333',
                s=220,
                zorder=8,
                linewidth=1.3,
                alpha=0.9
            )

# --- Two-level x-axis labels ---
duration_positions = []
duration_labels = []
for i, method in enumerate(method_order):
    offset_48hr = xtick_positions[i] + duration_center_offsets['48hr']
    offset_72hr = xtick_positions[i] + duration_center_offsets['72hr']
    duration_positions.extend([offset_48hr, offset_72hr])
    duration_labels.extend(['48hr', '72hr'])

ax.set_xticks(duration_positions)
ax.set_xticklabels(duration_labels, fontsize=14, fontweight='bold', color='#444444')
ax.tick_params(axis='x', top=False, bottom=True, length=0, pad=8)

ax2 = ax.twiny()
ax2.set_xlim(ax.get_xlim())
ax2.set_xticks(xtick_positions)
ax2.set_xticklabels(xtick_labels)
for label in ax2.get_xticklabels():
    label.set_fontsize(20)
    label.set_fontweight('bold') 
    label.set_color('#000000')
ax2.xaxis.set_ticks_position('bottom')
ax2.xaxis.set_label_position('bottom')
ax2.spines['bottom'].set_position(('outward', 30))
ax2.tick_params(
    axis='x', top=False, bottom=True, length=0, pad=10,
    labelsize=15, labelcolor='#444444',
    labelrotation=0
)
for label in ax2.get_xticklabels():
    label.set_fontweight('bold')

# --- Axis styling ---
ax.set_ylabel('log10 Intensity', fontsize=16, fontweight='bold', color='#333333', labelpad=10)
ax.set_title(f'{current_gene} ({current_protein}) Protein Intensity Comparison',
             fontsize=19, fontweight='bold', color='#222222', pad=35)
ax.set_ylim(ymin, ylim_top)
ax.yaxis.grid(True, linestyle='--', linewidth=0.8, color='#cccccc', alpha=0.7)
ax.set_axisbelow(True)
ax.tick_params(axis='y', colors='#444444', labelsize=14, length=4, width=1.2)

# --- Legend System ---
condition_handles = [
    Line2D([0], [0], marker='o', color='w', label=cond, 
           markerfacecolor=color, markersize=14, markeredgecolor='#333333', 
           markeredgewidth=0.8, linestyle='None')
    for cond, color in cond_colors.items()
]
duration_handles = [
    Patch(facecolor=dura_colors[d], edgecolor='#666666', 
          label=f'{d}', alpha=0.25, linewidth=0.8)
    for d in ['48hr', '72hr']
]
status_handles = [
    Line2D([0], [0], marker=status_markers[s], color='w', label=f'{s}', 
           markerfacecolor=status_colors[s], markeredgecolor='#333333', 
           markersize=16, linestyle='None', markeredgewidth=1.0)
    for s in status_colors.keys() if s in status_markers
]
all_legend_elements = []
all_legend_labels = []
if condition_handles:
    all_legend_elements.extend(condition_handles)
    all_legend_labels.extend([h.get_label() for h in condition_handles])
if duration_handles:
    all_legend_elements.extend(duration_handles)
    all_legend_labels.extend([h.get_label() for h in duration_handles])
if status_handles:
    all_legend_elements.extend(status_handles)
    all_legend_labels.extend([h.get_label() for h in status_handles])
if all_legend_elements:
    ax.legend(
        all_legend_elements,
        all_legend_labels,
        loc='center left', bbox_to_anchor=(1.05, 0.5),
        fontsize=11, title='Legend', title_fontsize=13, 
        borderpad=1.2, labelspacing=1.1, frameon=False
    )
plt.subplots_adjust(right=0.75)
plots.finalize_plot(
    fig, 
    show=True,
    save=not is_demo,
    filename=f"ProteinIntensity_{current_protein}_{current_gene}",
    filepath=figure_path,
    formats=figure_formats, 
    transparent=transparent_bg,
    dpi=figure_dpi,
)

The above is one of the example proteins that can be highlighted here. The figure itself is pretty simple, but what it shows is that the equivalent status assigned by other methods at each timepoint showsn up and down regulated in dPF_1 and dPF_2 respectively. Meaning the LRBA protein has two additional proteoforms that are regulated differently under hypoxia vs normoxia conditions at different timepoints, which is missed by traditional protein-level quantification methods.

Note: These figures can be generate here for other proteins that you want to highlight as well. However since I won't be including these in the manuscript (even though I want to because they look very cool), I am not generating more of these plots here.

Note: These are super customized and don't have a function to generate them automatically due to it being more of a explainer and not something that will be included in the main manuscript.


05. Functional Enrichment Analysis

Perform Gene Ontology (GO) enrichment analysis using gProfiler to compare the biological interpretations derived from different protein quantification methods. This reveals how method choice affects downstream pathway-level conclusions.

05.1 Setup Enrichment Analysis

Configure enrichment parameters and define protein sets for each method-duration-status combination.

The sets of proteins belonging to each statistical status (Upregulated, Downregulated, Equivalent) from each quantification method and timepoint are prepared for GO enrichment analysis. This allows comparison of the biological themes captured by each approach.

The same parameters will be used for all enrichment analyses to ensure comparability. []

# Variables for the enrichment setup
m_pval = 0.1 
e_pval = 0.01  
correction = 'fdr' # fdr, bonferroni, g_SCS
pval_cap = 10**-15 
organism = "hsapiens"
# sources = ['GO:BP', 'GO:MF', 'GO:CC', 'KEGG', 'REAC', 'WP']
sources = ['GO:BP', 'GO:MF', 'GO:CC'] # Limit to Gene Ontology for faster results

# All Available Proteins as background...
background = list(set(original_protein.index).union(set(proteins_used)))

print(f"🔍 RUNNING ENRICHMENT ANALYSIS ON 24 PROTEIN SETS")
print(f"    - Background Proteins: {len(background):,}")
print(f"    - Methods: {', '.join(method_order)}")
print(f"    - Durations: {', '.join(durations)}")
print(f"    - Sources: {', '.join(sources)}")
print(f"    - p-value threshold: {m_pval} (term inclusion), {e_pval} (term reporting)")
print(f"    - Multiple Testing Correction: {correction.upper()}")
print(f"   {'Set Name':<30}   | {'Dur':<5} | {'Method':<10} | {'Status':<15} | {'# Proteins':>10}")
enrichment_sets = {}
# Flatten the dictionary for easier iteration
# Example Naming: '48hr_mean(top)_Upregulated'
for duration in ['48hr', '72hr']:
    for proMethod, protList in signf_protein_dict[duration].items():
        for status, proteins in protList.items():
            # if status == 'Equivalent':
            #     continue
            if proMethod in ['mean(all)', 'mean(top)']:
                continue
            set_name = f"{duration}_{proMethod}_{status}"
            enrichment_sets[set_name] = proteins.tolist()
            # Print with aligned columns for better readability
            print(f"   • {set_name:<30} | {duration:<5} | {proMethod:<10} | {status:<15} |  {len(proteins):>6} proteins")

queries = enrichment_sets.keys()
analysis_name = 'gProfiler_Enrichment_Analysis_AllSets'
# enrich.printParams(
#     m_pval,
#     e_pval,
#     correction,
#     pval_cap,
#     organism,
#     sources,
#     background,
#     enrichment_sets,
#     analysis_name
# )
🔍 RUNNING ENRICHMENT ANALYSIS ON 24 PROTEIN SETS
    - Background Proteins: 9,294
    - Methods: Original, mean(top), mean(all), dPF_0, dPF_1, dPF_2
    - Durations: 48hr, 72hr
    - Sources: GO:BP, GO:MF, GO:CC
    - p-value threshold: 0.1 (term inclusion), 0.01 (term reporting)
    - Multiple Testing Correction: FDR
   Set Name                         | Dur   | Method     | Status          | # Proteins
   • 48hr_Original_Upregulated      | 48hr  | Original   | Upregulated     |      65 proteins
   • 48hr_Original_Downregulated    | 48hr  | Original   | Downregulated   |      19 proteins
   • 48hr_Original_Equivalent       | 48hr  | Original   | Equivalent      |    1892 proteins
   • 48hr_dPFs_Upregulated          | 48hr  | dPFs       | Upregulated     |     344 proteins
   • 48hr_dPFs_Downregulated        | 48hr  | dPFs       | Downregulated   |     318 proteins
   • 48hr_dPFs_Equivalent           | 48hr  | dPFs       | Equivalent      |    2339 proteins
   • 72hr_Original_Upregulated      | 72hr  | Original   | Upregulated     |      23 proteins
   • 72hr_Original_Downregulated    | 72hr  | Original   | Downregulated   |      16 proteins
   • 72hr_Original_Equivalent       | 72hr  | Original   | Equivalent      |    1094 proteins
   • 72hr_dPFs_Upregulated          | 72hr  | dPFs       | Upregulated     |     685 proteins
   • 72hr_dPFs_Downregulated        | 72hr  | dPFs       | Downregulated   |     728 proteins
   • 72hr_dPFs_Equivalent           | 72hr  | dPFs       | Equivalent      |    1363 proteins

05.2 Run Enrichment for All Combinations

Execute GO enrichment analysis for each combination of quantification method, timepoint, and protein status using gProfiler. Since gProfiler accepts the multi-query format, all analyses are run in a single batch for efficiency and consistency.

Once it is run (and no parameters changed), the results can be saved and loaded directly to avoid re-running the enrichment each time. The commenting out the first function call to avoid re-running it every time.

# # Run and Save the Enrichment Analysis
# enrich.run_gprofiler(
#     query = enrichment_sets,
#     background = background,
#     organism = organism,
#     user_threshold = m_pval,
#     signf_threshold = e_pval,
#     correction = correction,
#     sources = sources,
#     pval_cap = pval_cap,
#     no_evidences = True,
#     no_iea = False,
#     ordered = False,
#     simplify_cols=True,
#     save_path = output_path,
#     analysis_name = analysis_name,
#     verbose = True
# )

# print()

# Read the results
gp_res = pd.read_csv(
    os.path.join(
        output_path, 
        f"{analysis_name}.csv"
    ),
    sep=","
)

# Create a subset of the results with the p-value cutoff
subset_gp_res = gp_res[
    gp_res["p_value"] < e_pval
]
subset_gp_res[['Duration', 'Method', 'Status']] = subset_gp_res['query'].str.split('_', expand=True)

print(f"""After the p-value cutoff of {e_pval} there are total of {subset_gp_res.shape[0]} enriched terms.""")

utils.print_series(
    subset_gp_res["query"].value_counts(),
    header = f"Number of enriched terms per query with p-val cutoff {e_pval}:",
    tab = 2,
    elements_with_order = list(queries)
)
After the p-value cutoff of 0.01 there are total of 2167 enriched terms.
Number of enriched terms per query with p-val cutoff 0.01:
   48hr_Original_Upregulated -> 37
   48hr_Original_Downregulated -> 6
   48hr_Original_Equivalent -> 458
   48hr_dPFs_Upregulated -> 118
   48hr_dPFs_Downregulated -> 97
   48hr_dPFs_Equivalent -> 449
   72hr_Original_Upregulated -> 5
   72hr_Original_Equivalent -> 294
   72hr_dPFs_Upregulated -> 176
   72hr_dPFs_Downregulated -> 233
   72hr_dPFs_Equivalent -> 294

At the enrichment cutoff the number of enriched GO terms varies across methods and statuses. While I can look invidually I wanted to create a summary plot that shows the number of enriched GO terms across all method-duration-status combinations for easier comparison.


05.3 Summary of Enriched GO Terms Across Methods

05.3.1 Heatmap of Enriched GO Terms Across Methods

This is a simple heatmap showing the number of enriched GO terms across all combinations of quantification methods, timepoints, and protein statuses. This visualization allows easy comparison of the biological insights captured by each approach.

from matplotlib import cm
import matplotlib.patches as mpatches
import matplotlib.colors as mcolors

# --- Prepare the heatmap data ---
heatmap_df = subset_gp_res.pivot_table(
    index=['Duration', 'Method'],
    columns=['source', 'Status'],
    values='name',
    aggfunc='count',
    fill_value=0
)

# --- Set up the figure and axes ---
fig, ax = plt.subplots(figsize=(14, 5))

# --- Draw the heatmap ---
sns.heatmap(
    heatmap_df,
    cmap="Blues",
    annot=True,
    fmt=".0f",
    linewidths=0.7,
    linecolor='lightgrey',
    square=False,
    ax=ax,
    cbar_kws={
        "label": "Number of Enriched Terms",
        "orientation": "horizontal",
        "shrink": 0.7,
        "pad": 0.12
    },
    annot_kws={"fontsize": 12, "fontweight": "bold",},
    mask=heatmap_df.isnull(),
    vmin=0
)

# --- Axis and tick styling ---
ax.set_xlabel("Enrichment Source & Status", fontsize=16, fontweight="bold", labelpad=16)
ax.set_ylabel("Duration & Method", fontsize=16, fontweight="bold", labelpad=16)
ax.set_title(
    "Pathway Enrichment Results by Method, Duration, Source, and Status",
    fontsize=20, fontweight="bold", pad=22, color="#222222"
)

# --- Improve tick labels ---
xticklabels = [f"{source}\n{status}" for source, status in heatmap_df.columns]
ax.set_xticklabels(xticklabels, rotation=35, ha='right', fontsize=13, fontweight='semibold', color="#333333")
yticklabels = [f"{duration}\n{method}" for duration, method in heatmap_df.index]
ax.set_yticklabels(yticklabels, rotation=0, fontsize=14, fontweight='semibold', color="#333333")

# --- Add gridlines for better separation ---
ax.hlines(
    [i for i in range(1, len(heatmap_df.index))],
    *ax.get_xlim(),
    colors='grey', linestyles='dashed', linewidth=0.5, alpha=0.25
)
ax.vlines(
    [i for i in range(1, len(heatmap_df.columns))],
    *ax.get_ylim(),
    colors='grey', linestyles='dashed', linewidth=0.5, alpha=0.25
)

# --- Add value labels for zero cells (optional, faded) ---
for (i, j), val in np.ndenumerate(heatmap_df.values):
    if val == 0:
        ax.text(
            j + 0.5, i + 0.5, "0",
            ha='center', va='center',
            fontsize=11, color="#bbbbbb", alpha=0.7
        )

# --- Add extra annotation for total terms per row/column (optional) ---
row_totals = heatmap_df.sum(axis=1)
for i, total in enumerate(row_totals):
    ax.text(
        len(heatmap_df.columns) + 0.3, i + 0.5, f"Σ={total}",
        va='center', ha='left', fontsize=13, color="#444444", fontweight='bold'
    )
col_totals = heatmap_df.sum(axis=0)
for j, total in enumerate(col_totals):
    ax.text(
        j + 0.5, len(heatmap_df.index) + 0.25, f"{total}", 
        ha='center', va='bottom', fontsize=12, color="#444444", fontweight='bold'
    )

# --- Add color rectangles for Duration (row side) and Source (column side) ---
# Use dura_colors for Duration
duration_list = heatmap_df.index.get_level_values('Duration').unique()
duration_colors = dura_colors

source_list = heatmap_df.columns.get_level_values('source').unique()
source_colors = dict(zip(source_list, cm.Set2.colors[:len(source_list)]))

# Row-side rectangles for Duration
for i, (duration, method) in enumerate(heatmap_df.index):
    rect = mpatches.Rectangle(
        (-0.9, i), 0.18, 1,  # x, y, width, height
        facecolor=duration_colors[duration],
        edgecolor='none',
        transform=ax.transData,
        clip_on=False
    )
    ax.add_patch(rect)

# Column-side rectangles for Source
for j, (source, status) in enumerate(heatmap_df.columns):
    rect = mpatches.Rectangle(
        (j, len(heatmap_df.index) + 0.08), 1, 0.18,  # x, y, width, height
        facecolor=source_colors[source],
        edgecolor='none',
        transform=ax.transData,
        clip_on=False
    )
    ax.add_patch(rect)

# --- Add color rectangles for Status (column side, above heatmap) ---
status_colors_map = status_colors  # already defined

# Get the Status for each column in heatmap_df
status_list = heatmap_df.columns.get_level_values('Status')

for j, status in enumerate(status_list):
    rect = mpatches.Rectangle(
        (j, len(heatmap_df.index) + 0.28), 1, 0.18,  # x, y, width, height (above the source rectangles)
        facecolor=status_colors_map.get(status, "#cccccc"),
        edgecolor='none',
        transform=ax.transData,
        clip_on=False
    )
    ax.add_patch(rect)


# Add legends for Duration and Source, with more spacing
duration_handles = [mpatches.Patch(color=duration_colors[d], label=d) for d in duration_list]
source_handles = [mpatches.Patch(color=source_colors[s], label=s) for s in source_list]
legend1 = ax.legend(handles=duration_handles, title="Duration", loc='upper left', bbox_to_anchor=(1.03, 1), fontsize=13, title_fontsize=14, borderaxespad=1)
legend2 = ax.legend(handles=source_handles, title="Source", loc='upper left', bbox_to_anchor=(1.03, 0.7), fontsize=13, title_fontsize=14, borderaxespad=1)
ax.add_artist(legend1)

# --- Tight layout and finalize ---
plt.subplots_adjust(left=0.18, right=0.82, top=0.92, bottom=0.13)
plt.tight_layout(rect=[0, 0, 0.97, 1])
plots.finalize_plot(
    fig, 
    show=True,
    save=True,
    filename=f"Enrichment_Heatmap_AllSets",
    filepath=figure_path,
    formats=figure_formats, 
    transparent=transparent_bg,
    dpi=figure_dpi,
)

The figure effectivelly displays how the dPFs expands the biological insights from each protein status, timepoint, and method combination by increasing the number of enriched GO terms detected.

05.3.2 Linear Flow Diagram

The linear flow diagram going from Original to dPFs method for each timepoint and status shows how many new GO terms are gained or lost when moving to proteoform-aware quantification. This highlights the impact of method choice on biological interpretations.

# --- Reshape the data ---
# Pivot the table to get a 'Status' column for each method
# If there are multiple statuses for a term/method (e.g., multiple dPFs), keep all as separate rows
status_pivot_df = (
    subset_gp_res
    .pivot_table(
        index=['name', 'Duration', 'source', 'native'],
        columns='Method',
        values='Status',
        aggfunc=lambda x: list(x) if len(x) > 1 else x.iloc[0] if len(x) == 1 else np.nan
    )
    .reset_index()
)

# Explode any columns that have lists (i.e., multiple statuses for a method)
for col in status_pivot_df.columns:
    if col not in ['name', 'Duration', 'source', 'native']:
        if status_pivot_df[col].apply(lambda x: isinstance(x, list)).any():
            status_pivot_df = status_pivot_df.explode(col, ignore_index=True)

# Fill missing values with 'Not Enriched'
status_pivot_df = status_pivot_df.fillna('Not Enriched')

# --- Isolate the specific changes of interest ---
# Find terms that were 'Equivalent' in the 'Original' method
# but became 'Upregulated' or 'Downregulated' in the 'dPFs' method.
specific_changes_df = status_pivot_df[
    (status_pivot_df['Original'] == 'Equivalent') &
    (status_pivot_df['dPFs'].isin(['Upregulated', 'Downregulated']))
]

print("Example of pivoted data:")
print(status_pivot_df.head())
print("\nExample of isolated changes:")
print(specific_changes_df.head())
Example of pivoted data:
Method                                           name Duration source      native      Original          dPFs
0       2-oxoglutarate-dependent dioxygenase activity     48hr  GO:MF  GO:0016706   Upregulated   Upregulated
1       2-oxoglutarate-dependent dioxygenase activity     72hr  GO:MF  GO:0016706  Not Enriched   Upregulated
2                   6-phosphofructo-2-kinase activity     48hr  GO:MF  GO:0003873   Upregulated  Not Enriched
3                               ADP metabolic process     48hr  GO:BP  GO:0046031   Upregulated  Not Enriched
4                               ADP metabolic process     72hr  GO:BP  GO:0046031  Not Enriched   Upregulated

Example of isolated changes:
Method                     name Duration source      native    Original           dPFs
8                   ATP binding     48hr  GO:MF  GO:0005524  Equivalent  Downregulated
9                   ATP binding     72hr  GO:MF  GO:0005524  Equivalent    Upregulated
10                  ATP binding     72hr  GO:MF  GO:0005524  Equivalent  Downregulated
13      ATP hydrolysis activity     48hr  GO:MF  GO:0016887  Equivalent  Downregulated
14      ATP hydrolysis activity     72hr  GO:MF  GO:0016887  Equivalent    Upregulated
def plot_linear_flow(ax, data, methods, colors, status_order, add_legend=False):
    """
    Draws a sequential flow chart for a given list of methods on a matplotlib axes.
    """
    num_methods = len(methods)
    
    # --- Pre-calculate totals for all bars ---
    bar_totals = {}
    for method in methods:
        bar_totals[method] = data[method].value_counts().reindex(status_order, fill_value=0)

    # --- Draw all stacked bars with annotations ---
    for i, method in enumerate(methods):
        bottom = 0
        totals = bar_totals[method]
        for status in status_order:
            height = totals.get(status, 0)
            label = status if add_legend and i == 0 else None # Add legend info only once
            ax.bar(i + 1, height, bottom=bottom, color=colors[status], width=0.5, label=label, alpha=0.9, edgecolor='black', linewidth=1)
            if height > 5: # Annotation threshold
                ax.text(i + 1, bottom + height / 2, int(height), ha='center', va='center', color='white', fontsize=12, fontweight='bold')
            bottom += height

    # --- Draw all connecting flow polygons ---
    for i in range(num_methods - 1):
        method1, method2 = methods[i], methods[i+1]
        
        flow_counts = data.groupby([method1, method2]).size().unstack(fill_value=0)
        flow_counts = flow_counts.reindex(index=status_order, columns=status_order, fill_value=0)
        
        left_totals = bar_totals[method1]
        right_totals = bar_totals[method2]
        
        left_bottoms = left_totals.cumsum() - left_totals
        right_bottoms = right_totals.cumsum() - right_totals
        
        for left_status in status_order:
            for right_status in status_order:
                count = flow_counts.at[left_status, right_status]
                if count > 0:
                    y_left_start = left_bottoms[left_status]
                    y_right_start = right_bottoms[right_status]
                    polygon = plt.Polygon([
                        [i + 1.25, y_left_start], [i + 1.75, y_right_start],
                        [i + 1.75, y_right_start + count], [i + 1.25, y_left_start + count]
                    ], facecolor=colors[left_status], alpha=0.35, edgecolor='none')
                    ax.add_patch(polygon)
                    left_bottoms[left_status] += count
                    right_bottoms[right_status] += count

# --- Main Plotting Setup ---
durations = ['48hr', '72hr']
methods_to_plot = ['Original', 'dPFs']
status_order = ['Upregulated', 'Downregulated', 'Equivalent', 'Not Enriched']
status_colors = {
    'Not Enriched': "#C2C0C0",
    'Upregulated': '#780000',
    'Downregulated': '#e36414',
    'Equivalent': '#003049',
}

# Create the subplot grid (single column)
fig, axes = plt.subplots(
    nrows=len(durations),
    ncols=1,
    figsize=(5, 10),
    sharex=True,
    sharey=True, 
    gridspec_kw={'hspace': 0.05}
)
# Ensure axes is always an array
if len(durations) == 1:
    axes = [axes]

# --- Iterate through durations and plot ---
for i, duration in enumerate(durations):
    ax = axes[i]
    flow_data = status_pivot_df[status_pivot_df['Duration'] == duration]
    
    # Call the main plotting function
    plot_linear_flow(ax, flow_data, methods_to_plot, status_colors, status_order, add_legend=(i==0))

    # --- Styling and Labeling ---
    ax.spines[['top', 'right']].set_visible(False)
    ax.set_ylabel(f'Duration: {duration}\n\nNumber of Terms', fontsize=12, fontweight='bold')

# --- Final Figure-Level Adjustments ---
# Set shared X-axis labels
plt.xticks(ticks=range(1, len(methods_to_plot) + 1), labels=methods_to_plot, fontsize=15)

# Create a single, shared legend
handles, labels = axes[0].get_legend_handles_labels()


fig.suptitle("Linear Flow of Term Status Across All Methods", fontsize=18, fontweight='bold')

# Adjust layout to make room for the legend and prevent overlap
fig.subplots_adjust(left=0.1, right=0.88, top=0.92, bottom=0.15, hspace=0.2)

# Place the legend to the right of the subplots
fig.legend(handles, labels, loc='center right', fontsize=12, title='Term Status', title_fontsize=14, borderaxespad=1)

plots.finalize_plot(
    fig, 
    show=True,
    save=True,
    filename=f"LinearFlow_AllMethods_AllDurations",
    filepath=figure_path,
    formats=figure_formats, 
    transparent=transparent_bg,
    dpi=figure_dpi,
)

The enrichment results by term status at each timepoint (facet), shows that majority terms that were from statistically equivalent sets in Original method become up/down regulated in dPFs method showing the added biological insights gained from proteoform-aware quantification. This also shows that with added proteoform resolution, more specific biological processes can be uncovered that are missed at the protein level.

05.4 Examining in Detail for Selected Groups/Categories

In this part, I will be focusing on uncovering the groups and categories within the enriched GO terms that are particularly interesting or relevant to hypoxia biology. This may involve looking for terms related to known hypoxia pathways, cellular stress responses, metabolic changes, or other processes implicated in the cellular response to low oxygen conditions.

Examining the Top MF and BP Terms

duplicated_names = status_pivot_df.groupby('name')['Duration'].nunique()
duplicated_names = duplicated_names[duplicated_names == 2].index.tolist()
print(f"Terms appearing in both durations: {len(duplicated_names)}")
print()
print("GO:MF Terms:")
utils.view_table(
    status_pivot_df[
        (status_pivot_df['name'].isin(duplicated_names)) &
        (status_pivot_df['source'] == 'GO:MF')
    ].sort_values(['name', 'Duration']),
    page_number=1
)
print()
print("GO:BP Terms:")
utils.view_table(
        status_pivot_df[
        (status_pivot_df['name'].isin(duplicated_names)) &
        (status_pivot_df['source'] == 'GO:BP')
    ].sort_values(['name', 'Duration']),
    page_number=1
)
Terms appearing in both durations: 387

GO:MF Terms:
Total pages: 12, Current page: 1
Method name Duration source native Original dPFs
0 2-oxoglutarate-dependent dioxygenase activity 48hr GO:MF GO:0016706 Upregulated Upregulated
1 2-oxoglutarate-dependent dioxygenase activity 72hr GO:MF GO:0016706 Not Enriched Upregulated
7 ATP binding 48hr GO:MF GO:0005524 Equivalent Equivalent
8 ATP binding 48hr GO:MF GO:0005524 Equivalent Downregulated
9 ATP binding 72hr GO:MF GO:0005524 Equivalent Upregulated
10 ATP binding 72hr GO:MF GO:0005524 Equivalent Downregulated
11 ATP binding 72hr GO:MF GO:0005524 Equivalent Equivalent
12 ATP hydrolysis activity 48hr GO:MF GO:0016887 Equivalent Equivalent
13 ATP hydrolysis activity 48hr GO:MF GO:0016887 Equivalent Downregulated
14 ATP hydrolysis activity 72hr GO:MF GO:0016887 Equivalent Upregulated
15 ATP hydrolysis activity 72hr GO:MF GO:0016887 Equivalent Downregulated
16 ATP hydrolysis activity 72hr GO:MF GO:0016887 Equivalent Equivalent
17 ATP-dependent activity 48hr GO:MF GO:0140657 Equivalent Equivalent
18 ATP-dependent activity 48hr GO:MF GO:0140657 Equivalent Downregulated
19 ATP-dependent activity 72hr GO:MF GO:0140657 Equivalent Downregulated
20 ATP-dependent activity 72hr GO:MF GO:0140657 Equivalent Upregulated
21 ATP-dependent activity 72hr GO:MF GO:0140657 Equivalent Equivalent
22 ATP-dependent activity, acting on DNA 48hr GO:MF GO:0008094 Not Enriched Equivalent
23 ATP-dependent activity, acting on DNA 72hr GO:MF GO:0008094 Equivalent Upregulated
24 ATP-dependent activity, acting on DNA 72hr GO:MF GO:0008094 Equivalent Equivalent
25 ATP-dependent activity, acting on RNA 48hr GO:MF GO:0008186 Equivalent Equivalent
26 ATP-dependent activity, acting on RNA 72hr GO:MF GO:0008186 Not Enriched Upregulated
27 ATP-dependent activity, acting on RNA 72hr GO:MF GO:0008186 Not Enriched Downregulated
28 ATP-dependent chromatin remodeler activity 48hr GO:MF GO:0140658 Not Enriched Equivalent
29 ATP-dependent chromatin remodeler activity 72hr GO:MF GO:0140658 Equivalent Upregulated
GO:BP Terms:
Total pages: 23, Current page: 1
Method name Duration source native Original dPFs
3 ADP metabolic process 48hr GO:BP GO:0046031 Upregulated Not Enriched
4 ADP metabolic process 72hr GO:BP GO:0046031 Not Enriched Upregulated
37 DNA damage response 48hr GO:BP GO:0006974 Equivalent Equivalent
38 DNA damage response 72hr GO:BP GO:0006974 Equivalent Downregulated
39 DNA metabolic process 48hr GO:BP GO:0006259 Equivalent Equivalent
40 DNA metabolic process 72hr GO:BP GO:0006259 Not Enriched Downregulated
41 DNA repair 48hr GO:BP GO:0006281 Not Enriched Equivalent
42 DNA repair 72hr GO:BP GO:0006281 Equivalent Downregulated
43 DNA repair 72hr GO:BP GO:0006281 Equivalent Equivalent
45 DNA-templated transcription elongation 48hr GO:BP GO:0006354 Equivalent Equivalent
46 DNA-templated transcription elongation 72hr GO:BP GO:0006354 Equivalent Not Enriched
78 Golgi to plasma membrane transport 48hr GO:BP GO:0006893 Not Enriched Equivalent
79 Golgi to plasma membrane transport 72hr GO:BP GO:0006893 Not Enriched Equivalent
80 Golgi vesicle transport 48hr GO:BP GO:0048193 Equivalent Equivalent
81 Golgi vesicle transport 72hr GO:BP GO:0048193 Equivalent Equivalent
99 RNA export from nucleus 48hr GO:BP GO:0006405 Equivalent Equivalent
100 RNA export from nucleus 72hr GO:BP GO:0006405 Equivalent Equivalent
104 RNA localization 48hr GO:BP GO:0006403 Equivalent Equivalent
105 RNA localization 72hr GO:BP GO:0006403 Equivalent Equivalent
106 RNA localization 72hr GO:BP GO:0006403 Equivalent Downregulated
109 RNA processing 48hr GO:BP GO:0006396 Equivalent Equivalent
110 RNA processing 72hr GO:BP GO:0006396 Equivalent Equivalent
111 RNA splicing 48hr GO:BP GO:0008380 Equivalent Equivalent
112 RNA splicing 72hr GO:BP GO:0008380 Equivalent Equivalent
113 RNA splicing, via transesterification reactions 48hr GO:BP GO:0000375 Equivalent Equivalent

There are many cases where same terms are enriched at more than one status for a given timepoint, this presents a very interesting case that dPFs as we have shown before can capture different proteoform-level regulations that are missed at the protein level. Protein level cannot represent that protein at two different regulatory states at the same timepoint, but with dPFs we can capture that.

# Identify cases where same Duration-Method is repeated with different Status
conflicts = subset_gp_res.groupby(['name', 'Duration', 'Method']).filter(lambda x: x['Status'].nunique() > 1)
conflicts = conflicts.sort_values(['name', 'Duration', 'Method'])
utils.view_table(conflicts[conflicts['source']=='GO:BP'], page_number=11)
Total pages: 11, Current page: 11
query source native name significant p_value p_capped -log10(p_value) -log10(p_capped) Enrichment GeneRatio parents Duration Method Status
868 72hr_dPFs_Downregulated GO:BP GO:0051056 regulation of small GTPase mediated signal tra... True 0.0001 0.0001 3.9388 3.9388 2.4640 0.1762 ['GO:0007264', 'GO:1902531'] 72hr dPFs Downregulated
2093 72hr_dPFs_Equivalent GO:BP GO:0051056 regulation of small GTPase mediated signal tra... True 0.0089 0.0089 2.0518 2.0518 1.7228 0.2280 ['GO:0007264', 'GO:1902531'] 72hr dPFs Equivalent
1591 48hr_dPFs_Equivalent GO:BP GO:1902903 regulation of supramolecular fiber organization True 0.0032 0.0032 2.4950 2.4950 1.4594 0.3281 ['GO:0051128', 'GO:0097435'] 48hr dPFs Equivalent
1897 48hr_dPFs_Upregulated GO:BP GO:1902903 regulation of supramolecular fiber organization True 0.0063 0.0063 2.2004 2.2004 2.5169 0.0820 ['GO:0051128', 'GO:0097435'] 48hr dPFs Upregulated
803 72hr_dPFs_Equivalent GO:BP GO:1902903 regulation of supramolecular fiber organization True 0.0001 0.0001 4.1720 4.1720 1.8597 0.2461 ['GO:0051128', 'GO:0097435'] 72hr dPFs Equivalent
1975 72hr_dPFs_Downregulated GO:BP GO:1902903 regulation of supramolecular fiber organization True 0.0072 0.0072 2.1425 2.1425 1.9123 0.1367 ['GO:0051128', 'GO:0097435'] 72hr dPFs Downregulated
1157 72hr_dPFs_Downregulated GO:BP GO:0007264 small GTPase-mediated signal transduction True 0.0006 0.0006 3.1996 3.1996 1.9918 0.1424 ['GO:0141124'] 72hr dPFs Downregulated
2042 72hr_dPFs_Equivalent GO:BP GO:0007264 small GTPase-mediated signal transduction True 0.0081 0.0081 2.0911 2.0911 1.5544 0.2057 ['GO:0141124'] 72hr dPFs Equivalent
1411 72hr_dPFs_Upregulated GO:BP GO:0051653 spindle localization True 0.0018 0.0018 2.7443 2.7443 3.9755 0.2653 ['GO:0022402', 'GO:0051640'] 72hr dPFs Upregulated
1451 72hr_dPFs_Downregulated GO:BP GO:0051653 spindle localization True 0.0020 0.0020 2.6907 2.6907 3.7108 0.2653 ['GO:0022402', 'GO:0051640'] 72hr dPFs Downregulated
1177 72hr_dPFs_Equivalent GO:BP GO:0097435 supramolecular fiber organization True 0.0007 0.0007 3.1540 3.1540 1.5028 0.1989 ['GO:0016043'] 72hr dPFs Equivalent
1415 72hr_dPFs_Downregulated GO:BP GO:0097435 supramolecular fiber organization True 0.0018 0.0018 2.7435 2.7435 1.6795 0.1201 ['GO:0016043'] 72hr dPFs Downregulated
1734 72hr_dPFs_Upregulated GO:BP GO:0097435 supramolecular fiber organization True 0.0044 0.0044 2.3567 2.3567 1.6868 0.1126 ['GO:0016043'] 72hr dPFs Upregulated
823 48hr_dPFs_Upregulated GO:BP GO:0048731 system development True 0.0001 0.0001 4.0909 4.0909 1.6123 0.0525 ['GO:0007275', 'GO:0048856'] 48hr dPFs Upregulated
2071 48hr_dPFs_Downregulated GO:BP GO:0048731 system development True 0.0085 0.0085 2.0711 2.0711 1.4610 0.0467 ['GO:0007275', 'GO:0048856'] 48hr dPFs Downregulated
444 72hr_dPFs_Upregulated GO:BP GO:0048731 system development True 0.0000 0.0000 6.0431 6.0431 1.4794 0.0987 ['GO:0007275', 'GO:0048856'] 72hr dPFs Upregulated
1896 72hr_dPFs_Downregulated GO:BP GO:0048731 system development True 0.0063 0.0063 2.2006 2.2006 1.2844 0.0918 ['GO:0007275', 'GO:0048856'] 72hr dPFs Downregulated
utils.view_table(
        status_pivot_df[
        (status_pivot_df['name'].isin(duplicated_names)) &
        (status_pivot_df['source'] == 'GO:BP')
    ].sort_values(['name', 'Duration']),
    page_number=5
)
Total pages: 23, Current page: 5
Method name Duration source native Original dPFs
339 cellular component organization 72hr GO:BP GO:0016043 Equivalent Upregulated
340 cellular component organization or biogenesis 48hr GO:BP GO:0071840 Equivalent Equivalent
341 cellular component organization or biogenesis 48hr GO:BP GO:0071840 Equivalent Downregulated
342 cellular component organization or biogenesis 48hr GO:BP GO:0071840 Equivalent Upregulated
343 cellular component organization or biogenesis 72hr GO:BP GO:0071840 Equivalent Downregulated
344 cellular component organization or biogenesis 72hr GO:BP GO:0071840 Equivalent Equivalent
345 cellular component organization or biogenesis 72hr GO:BP GO:0071840 Equivalent Upregulated
346 cellular developmental process 48hr GO:BP GO:0048869 Equivalent Equivalent
347 cellular developmental process 48hr GO:BP GO:0048869 Equivalent Upregulated
348 cellular developmental process 48hr GO:BP GO:0048869 Equivalent Downregulated
349 cellular developmental process 72hr GO:BP GO:0048869 Not Enriched Upregulated
350 cellular developmental process 72hr GO:BP GO:0048869 Not Enriched Downregulated
351 cellular localization 48hr GO:BP GO:0051641 Equivalent Equivalent
352 cellular localization 48hr GO:BP GO:0051641 Equivalent Downregulated
353 cellular localization 72hr GO:BP GO:0051641 Equivalent Equivalent
354 cellular localization 72hr GO:BP GO:0051641 Equivalent Downregulated
355 cellular localization 72hr GO:BP GO:0051641 Equivalent Upregulated
356 cellular macromolecule localization 48hr GO:BP GO:0070727 Equivalent Equivalent
357 cellular macromolecule localization 48hr GO:BP GO:0070727 Equivalent Downregulated
358 cellular macromolecule localization 72hr GO:BP GO:0070727 Equivalent Equivalent
359 cellular macromolecule localization 72hr GO:BP GO:0070727 Equivalent Downregulated
360 cellular macromolecule localization 72hr GO:BP GO:0070727 Equivalent Upregulated
361 cellular process 48hr GO:BP GO:0009987 Equivalent Equivalent
362 cellular process 48hr GO:BP GO:0009987 Equivalent Downregulated
363 cellular process 72hr GO:BP GO:0009987 Equivalent Equivalent

Previous two table views are paged and can be scrolled within the notebook, I have examined them and categorized them based on their relevance to hypoxia biologym a well a other categories of dictionaries below.


# --- Curated Dictionaries of GO:BP Terms ---
# This file contains several 'terms_of_interest' dictionaries, each curated
# to highlight different biological themes based on the enrichment data.

# ==============================================================================
# DICTIONARY 1: High-Level Biological Themes (Similar to the plot)
# A balanced overview of major processes. Good for a general summary.
# ==============================================================================
terms_of_interest_high_level = {
    "Response to Stimulus & Hypoxia": [
        "response to hypoxia",
        "cellular response to stimulus",
        "DNA damage response",
        "response to stress"
    ],
    "Cell Cycle & Chromosome Dynamics": [
        "mitotic cell cycle process",
        "chromosome segregation",
        "cell division",
        "regulation of cell cycle"
    ],
    "Gene Expression & RNA Metabolism": [
        "gene expression",
        "mRNA processing",
        "RNA splicing",
        "chromatin organization"
    ],
    "Cellular Architecture & Morphogenesis": [
        "cytoskeleton organization",
        "organelle organization",
        "anatomical structure morphogenesis",
        "cell projection organization"
    ],
    "Localization & Transport": [
        "protein localization",
        "vesicle-mediated transport",
        "nucleocytoplasmic transport",
        "macromolecule localization"
    ],
    "Biological Regulation & Signaling": [
        "positive regulation of biological process",
        "negative regulation of biological process",
        "small GTPase mediated signal transduction",
        "regulation of signaling"
    ]
}

# ==============================================================================
# DICTIONARY 2: Focus on Protein Lifecycle & Metabolism
# Tracks processes from protein creation to breakdown.
# ==============================================================================
terms_of_interest_protein_focus = {
    "Protein Synthesis & Modification": [
        "translation",
        "protein modification process",
        "protein phosphorylation",
        "ribosome biogenesis"
    ],
    "Protein Transport & Localization": [
        "protein localization",
        "protein transport",
        "intracellular protein transport",
        "establishment of protein localization"
    ],
    "Protein Complex Assembly": [
        "protein-containing complex assembly",
        "protein polymerization",
        "supramolecular fiber organization",
        "regulation of protein-containing complex assembly"
    ],
    "Protein Catabolism & Breakdown": [
        "protein catabolic process",
        "proteasomal protein catabolic process",
        "ubiquitin-dependent protein catabolic process",
        "proteolysis"
    ]
}

# ==============================================================================
# DICTIONARY 3: Focus on Developmental Processes
# Highlights terms related to development, differentiation, and morphogenesis.
# ==============================================================================
terms_of_interest_development_focus = {
    "System & Tissue Development": [
        "multicellular organism development",
        "anatomical structure development",
        "nervous system development",
        "tissue development"
    ],
    "Cellular Development & Differentiation": [
        "cellular developmental process",
        "cell differentiation",
        "neurogenesis",
        "generation of neurons"
    ],
    "Morphogenesis & Patterning": [
        "anatomical structure morphogenesis",
        "cell morphogenesis",
        "cell projection organization",
        "establishment or maintenance of cell polarity"
    ]
}

# ==============================================================================
# DICTIONARY 4: Focus on Intracellular Signaling & Communication
# Centers on how cells communicate and respond to signals.
# ==============================================================================
terms_of_interest_signaling_focus = {
    "Signal Transduction Pathways": [
        "signal transduction",
        "small GTPase mediated signal transduction",
        "intracellular signal transduction",
        "regulation of signaling"
    ],
    "Cell Communication": [
        "cell communication",
        "cell-cell adhesion",
        "regulation of cell communication"
    ],
    "Transport & Vesicles in Signaling": [
        "vesicle-mediated transport",
        "endocytosis",
        "Golgi vesicle transport",
        "regulation of localization"
    ]
}

# ==============================================================================
# DICTIONARY 5: Data-Driven & Interest-Focused Selection
# Curated based on specific phenomena of interest (hypoxia, metabolism) and
# data-driven criteria to compare methods.
# ==============================================================================
terms_of_interest_data_driven = {
    "Hypoxia Response": [
        'response to hypoxia',
        'response to decreased oxygen levels',
        'response to oxygen levels',
        'cellular response to hypoxia',
    ],
    "Energy & Carbohydrate Metabolism": [
        'hexose metabolic process',
        'glucose metabolic process',
        'carbohydrate metabolic process',
        'ADP metabolic process',
    ],
    "Cell Proliferation & Repair": [
        'cell cycle',
        'cell division',
        'chromosome segregation',
        'DNA damage response',
        'cell differentiation',
        'cell morphogenesis'
    ],
    "Cellular Component Dynamics": [
        'cellular component assembly',
        'cellular component organization',
        'cellular component biogenesis',
        'cellular component disassembly',
    ],
    "Gene Expression": [
        "gene expression",
        "RNA processing",
        "nucleic acid metabolic process",
        "macromolecule biosynthetic process"
    ],
    "Signaling": [
        # Selected to highlight where 'dPFs' shows enrichment (Up/Down/Equiv)
        # while the 'Original' method shows 'Not Enriched'.
        "regulation of small GTPase mediated signal transduction",
        "regulation of protein catabolic process",
        "regulation of cell cycle process",
        "negative regulation of gene expression"
    ]
}


# ===============================================================================
# DICTIONARY 6: Custom Terms of to Include/Exclude
# A flexible dictionary to manually include or exclude specific terms.
# This can be used to tailor the analysis to specific research questions.
# ===============================================================================
# A dictionary categorizing the GO terms into logical biological groups.
custom_terms_of_interest = {
    'Response to Stimulus': [
        'response to stress',
        'response to stimulus',
        'cellular response to stimulus',
        'response to hypoxia',
        'response to decreased oxygen levels',
        'response to oxygen levels',
        'cellular response to hypoxia',
        'cellular response to decreased oxygen levels',
    ],
    'Metabolism': [
        'hexose metabolic process',
        'glucose metabolic process',
        'carbohydrate metabolic process',
        'ADP metabolic process',

        'mRNA metabolic process',
        'RNA processing',
        'mRNA processing',
    ],
    'Cell Cycle': [
        'cell cycle',
        'cell division',
        'chromosome segregation',
        'DNA damage response',
        'DNA repair',
    ],
    'Cellular Organization': [
        'cellular process',
        'cell development',
        'cell differentiation',
        'cell morphogenesis',
        'cellular component assembly',
        'cellular component organization',
        'cellular component biogenesis',
        'cellular component disassembly',
    ],
    'Signaling': [
        'cell communication',
        'signal transduction',
        'signaling',
        'small GTPase-mediated signal transduction',
    ]
}

Custom Terms of Interest: is the custom grouping that I wanted to highlight in a single figure.

# A dictionary categorizing the GO terms into logical biological groups.
custom_terms_of_interest = {
    'Response to Stimulus': [
        'response to stress',
        'response to stimulus',
        'cellular response to stimulus',
        'response to hypoxia',
        'response to decreased oxygen levels',
        'response to oxygen levels',
        'cellular response to hypoxia',
        'cellular response to decreased oxygen levels',
    ],
    'Metabolism': [
        'hexose metabolic process',
        'glucose metabolic process',
        'carbohydrate metabolic process',
        'ADP metabolic process',

        'mRNA metabolic process',
        'RNA processing',
        'mRNA processing',
    ],
    'Cell Cycle': [
        'cell cycle',
        'cell division',
        'chromosome segregation',
        'DNA damage response',
        'DNA repair',
    ],
    'Cellular Organization': [
        'cellular process',
        'cell development',
        'cell differentiation',
        'cell morphogenesis',
        'cellular component assembly',
        'cellular component organization',
        'cellular component biogenesis',
        'cellular component disassembly',
    ],
    'Signaling': [
        'cell communication',
        'signal transduction',
        'signaling',
        'small GTPase-mediated signal transduction',
    ]
}

These will be placed in a long format heatmap like grid plot with points/pie charts showing if the term is enriched in which method-duration-status combination. Most wll have single status color representing but there will be two or even three status colors for some terms showing that they are enriched in multiple statuses for a given timepoint.

from matplotlib.patches import Wedge

terms_of_interest = custom_terms_of_interest

methods_to_show = ['Original', 'dPFs']
durations_to_show = ['48hr', '72hr']
status_order = ['Upregulated', 'Downregulated', 'Equivalent', 'Not Enriched']
plot_terms = [term for category in terms_of_interest.values() for term in category]

df = subset_gp_res.copy()
if 'name' in df.columns:
    df = df.rename(columns={'name': 'Term'})

df['Duration'] = df['Duration'].astype(str)
df['Method'] = df['Method'].astype(str)

plot_df = df[df['Term'].isin(plot_terms)].copy()
plot_df['neglog10_p'] = -np.log10(plot_df['p_value'].clip(lower=1e-300))

agg_df = plot_df.groupby(['Term', 'Duration', 'Method']).agg(
    statuses=('Status', list),
    p_values=('neglog10_p', list)
).reset_index()

all_combos = pd.MultiIndex.from_product(
    [plot_terms, durations_to_show, methods_to_show],
    names=['Term', 'Duration', 'Method']
)
plot_full = agg_df.set_index(['Term', 'Duration', 'Method']).reindex(all_combos).reset_index()
plot_full['statuses'] = plot_full['statuses'].apply(lambda x: x if isinstance(x, list) else ['Not Enriched'])
plot_full['p_values'] = plot_full['p_values'].apply(lambda x: x if isinstance(x, list) else [0])

x_labels = [f"{d}_{m}" for d in durations_to_show for m in methods_to_show]
pos_map = {label: i for i, label in enumerate(x_labels)}
plot_full['x'] = plot_full.apply(lambda r: pos_map[f"{r['Duration']}_{r['Method']}"], axis=1)

category_order = list(terms_of_interest.keys())[::-1]
term_order = [term for category in category_order for term in terms_of_interest[category]]
y_map = {term: i for i, term in enumerate(term_order)}
plot_full['y'] = plot_full['Term'].map(y_map)

min_area = 20
size_scale = 40
all_p_vals = [item for sublist in plot_full['p_values'] for item in sublist if item > 0]
max_s_val = min_area + (max(all_p_vals) * size_scale) if all_p_vals else min_area

def s_to_radius(s_val, max_s):
    max_radius = 0.45
    return max_radius * np.sqrt(s_val / max_s) if max_s > 0 else 0

fig, ax = plt.subplots(figsize=(max(8, len(x_labels) * 1.5), max(7, len(term_order) * 0.6)))

# Draw grid
ax.set_axisbelow(True)
ax.grid(True, which='major', axis='both', color='#e0e0e0', linestyle='-', linewidth=0.7, zorder=0)

# Draw vertical lines to separate duration groups
for i in range(1, len(durations_to_show)):
    ax.axvline(i * len(methods_to_show) - 0.5, color='#d9d9d9', linestyle='--', linewidth=1.2, zorder=1)

for _, row in plot_full.iterrows():
    if pd.isna(row['y']): continue
    statuses = row['statuses']
    p_vals = row['p_values']
    n_statuses = len(statuses)
    x, y = row['x'], row['y']

    if statuses[0] == 'Not Enriched':
        ax.scatter(x, y, s=min_area, facecolors='none', edgecolors='#bbbbbb', linewidths=1.2, zorder=5)
        continue

    max_p_val = max(p_vals) if p_vals else 0
    s_val = min_area + max_p_val * size_scale
    radius = s_to_radius(s_val, max_s_val)
    status_p_val_pairs = sorted(zip(statuses, p_vals), key=lambda pair: status_order.index(pair[0]))

    angle_step = 360 / n_statuses
    for i, (status, p_val) in enumerate(status_p_val_pairs):
        start_angle = i * angle_step
        end_angle = (i + 1) * angle_step
        color = status_colors.get(status, '#cccccc')
        wedge = Wedge(center=(x, y), r=radius, theta1=start_angle, theta2=end_angle,
                      facecolor=color, edgecolor='#333333', linewidth=1.0, zorder=5, alpha=0.9)
        ax.add_patch(wedge)

# --- Formatting and Legends (with Categories) ---

# Y-axis ticks and labels (wrap long names)
def wrap_label(label, width=28):
    import textwrap
    return "\n".join(textwrap.wrap(label, width=width))

yticks = list(y_map.values())
yticklabels = [wrap_label(term, width=28) for term in y_map.keys()]
ax.set_yticks(yticks)
ax.set_yticklabels(yticklabels, fontsize=12, va='center')

# Custom X-axis ticks
xticks = list(range(len(x_labels)))
xticklabels = [m for d in durations_to_show for m in methods_to_show]
ax.set_xticks(xticks)
ax.set_xticklabels(xticklabels, fontsize=12, rotation=0)

# Add overarching duration labels
for i, d in enumerate(durations_to_show):
    xpos = i * len(methods_to_show) + 0.5 * (len(methods_to_show) - 1)
    ax.text(xpos, 1.07, d, ha='center', va='bottom', fontsize=13, fontweight='bold', transform=ax.get_xaxis_transform())

# Add category separators and labels, offset to the left of y-ticks
current_y = 0
category_label_x = -2.5  # further left for category labels
for category_name, terms in reversed(list(terms_of_interest.items())):
    num_terms = len(terms)
    if current_y > 0:
        ax.axhline(current_y - 0.5, color='#888888', linestyle='-', linewidth=1.5, zorder=2)
    label_y = current_y + (num_terms - 1) / 2.0
    # Place category label further left, rotated and centered vertically
    ax.text(
        category_label_x, label_y, category_name,
        ha='center', va='center', fontsize=13, fontweight='bold',
        rotation=90, wrap=True, transform=ax.transData, clip_on=False
    )
    current_y += num_terms

ax.set_xlim(-0.5, len(x_labels) - 0.5)
ax.set_ylim(len(term_order) - 0.5, -0.5)
ax.set_title("GO Term Enrichment Status by Category\n(Color = Status, Area ≈ -log10(p-value))", fontsize=16, fontweight='bold', pad=40)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.tick_params(axis='both', which='both', length=0)

# Status legend
status_elems = [plt.Line2D([0], [0], marker='o', color='w', label=s,
                           markerfacecolor=c, markeredgecolor='#333333', markersize=10)
                for s, c in status_colors.items() if s in status_order]
legend1 = ax.legend(handles=status_elems, title='Status', bbox_to_anchor=(1.04, 1), loc='upper left', frameon=False, fontsize=11)
ax.add_artist(legend1)

# Size legend
size_handles = []
p_vals_for_legend = [1, 5, 10, 20]
for p_val in p_vals_for_legend:
    s = min_area + p_val * size_scale
    size_handles.append(plt.scatter([], [], s=s, label=f'{p_val}', color='#777777', edgecolor='#333333'))

legend2 = ax.legend(handles=size_handles, title='Max Significance\n(-log10 p-value)',
                    bbox_to_anchor=(1.04, 0.6), loc='upper left', frameon=False,
                    labelspacing=2.5, fontsize=11)
legend2.get_title().set_fontsize(12)

plt.tight_layout(rect=[0.08, 0, 0.85, 1])
plt.subplots_adjust(left=0.32)  # More space for category labels

plots.finalize_plot(
    fig, 
    show=True,
    save=True,
    filename=f"GOBP_Term_Enrichment_By_Category_dataDriven",
    filepath=figure_path,
    formats=figure_formats, 
    transparent=transparent_bg,
    dpi=figure_dpi,
)

The custom terms of interest plot effectively summarizes how proteoform-aware quantification using dPFs reveals additional biological processes relevant to hypoxia response that are missed by traditional protein-level methods. The presence of terms across multiple statuses highlights the complex regulation captured at the proteoform level.

One other type of plot that this information can be represented is the dumbell plots showing the shared/non-unique status terms between methods for each timepoint.

import textwrap

terms_of_interest = custom_terms_of_interest

df = subset_gp_res.copy()
if 'name' in df.columns:
    df = df.rename(columns={'name': 'Term'})

# Filter for the terms and methods we want to plot
plot_terms = [term for category in terms_of_interest.values() for term in category]
methods_to_show = ['Original', 'dPFs']
durations_to_show = sorted(df['Duration'].astype(str).unique())

plot_df = df[
    df['Term'].isin(plot_terms) &
    df['Method'].isin(methods_to_show)
].copy()

# Calculate significance for point sizing
plot_df['neglog10_p'] = -np.log10(plot_df['p_value'].clip(lower=1e-50))

# --- 2. Plotting Setup ---
category_order = list(terms_of_interest.keys())[::-1]
term_order = [term for category in category_order for term in terms_of_interest[category]]
y_map = {term: i for i, term in enumerate(term_order)}
plot_df['y'] = plot_df['Term'].map(y_map)
plot_df = plot_df.dropna(subset=['y'])

fig, axes = plt.subplots(
    1, len(durations_to_show),
    figsize=(16, max(7, len(term_order) * 0.45)),
    sharey=True
)
if len(durations_to_show) == 1:
    axes = [axes]

method_markers = {'Original': 'o', 'dPFs': 's'}
size_scale = 20

# --- 3. Drawing the Plot ---
for i, duration in enumerate(durations_to_show):
    ax = axes[i]
    duration_df = plot_df[plot_df['Duration'] == duration].copy()

    # Draw the horizontal range lines for each term
    if not duration_df.empty:
        range_df = duration_df.groupby('y')['GeneRatio'].agg(['min', 'max']).reset_index()
        ax.hlines(y=range_df['y'], xmin=range_df['min'], xmax=range_df['max'],
                    color='#cccccc', alpha=0.8, linewidth=1.5, zorder=1)

    # Draw the scatter points for each enrichment result
    for _, row in duration_df.iterrows():
        ax.scatter(
            x=row['GeneRatio'],
            y=row['y'],
            s=row['neglog10_p'] * size_scale,
            color=status_colors.get(row['Status'], '#cccccc'),
            marker=method_markers.get(row['Method'], 'x'),
            zorder=2,
            edgecolor='black',
            linewidth=0.5,
            alpha=0.9
        )

    # --- 4. Formatting the Axes ---
    ax.set_title(duration, fontsize=14, fontweight='bold')
    current_y = 0
    for category_name, terms in reversed(list(terms_of_interest.items())):
        if current_y > 0:
            ax.axhline(current_y - 0.5, color='#888888', linestyle='-', linewidth=1.5, zorder=0)
        current_y += len(terms)

    ax.grid(True, which='major', axis='both', color='#e0e0e0', linestyle='--', zorder=0)
    ax.set_xlabel('GeneRatio', fontsize=12)
    ax.tick_params(axis='x', labelsize=11)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.set_xlim(left=0)

# --- 5. Global Formatting and Legends ---
def wrap_label(label, width=35):
    return "\n".join(textwrap.wrap(label, width=width))

yticks = list(y_map.values())
yticklabels = [wrap_label(term) for term in y_map.keys()]
axes[0].set_yticks(yticks)
axes[0].set_yticklabels(yticklabels, fontsize=11, va='center')
axes[0].tick_params(axis='y', length=0)

# Determine position for category labels
max_xlim = max(ax.get_xlim()[1] for ax in axes)
category_label_x = -0.1 * max_xlim

current_y = 0
for category_name, terms in reversed(list(terms_of_interest.items())):
    label_y = current_y + (len(terms) - 1) / 2.0
    axes[0].text(
        category_label_x, label_y, wrap_label(category_name, 15),
        ha='center', va='center', fontsize=12, fontweight='bold',
        rotation=90, clip_on=False
    )
    current_y += len(terms)

fig.suptitle('GO Term Enrichment Comparison by Status and Method', fontsize=18, fontweight='bold', y=0.99)

# Create legends
status_elems = [plt.Line2D([0], [0], marker='o', color='w', label=s, markerfacecolor=c, markersize=10, markeredgecolor='k')
                for s, c in status_colors.items()]
legend1 = fig.legend(handles=status_elems, title='Status', loc='upper right', bbox_to_anchor=(0.98, 0.94), fontsize=11)
legend1.get_title().set_fontweight('bold')

method_elems = [plt.Line2D([0], [0], marker=m, color='w', label=l, linestyle='None', markersize=10, markeredgecolor='k')
                for l, m in method_markers.items()]
legend2 = fig.legend(handles=method_elems, title='Method', loc='upper right', bbox_to_anchor=(0.98, 0.82), fontsize=11)
legend2.get_title().set_fontweight('bold')

p_vals_for_legend = [1, 5, 15]
size_handles = [plt.scatter([], [], s=p * size_scale, label=f'{p}', color='#777777', edgecolor='k') for p in p_vals_for_legend]
legend3 = fig.legend(handles=size_handles, title='-log10(p-value)', loc='upper right', bbox_to_anchor=(0.98, 0.70), labelspacing=1.5, fontsize=11)
legend3.get_title().set_fontweight('bold')

plt.tight_layout(rect=[0.1, 0, 0.9, 0.95])
plots.finalize_plot(
    fig, 
    show=True,
    save=True,
    filename=f"GOBP_Term_Enrichment_Scatter_By_Category_dataDriven",
    filepath=figure_path,
    formats=figure_formats, 
    transparent=transparent_bg,
    dpi=figure_dpi,
)

The dumbell plot is not the most clean, and uses too much space, but if the main points is to show the shared/non-unique status terms between methods for each timepoint it can be useful.


Summary

This analysis compared four approaches to protein-level quantification and their impact on downstream analysis:

Key Findings:

  1. Reproducibility (CV Analysis):

    • All methods show acceptable CVs across conditions
    • dPF-based quantification provides proteoform-level resolution without sacrificing reproducibility
  2. Sample Separation (PCA):

    • All methods successfully separate hypoxia from normoxia conditions
    • dPF approach captures additional biological variation at the proteoform level
  3. Statistical Testing (QuEStVar):

    • Different quantification methods yield different numbers of significant proteins
    • dPF analysis identifies cases where proteoforms within the same protein show opposing regulation
  4. Enrichment Analysis:

    • Pathway enrichment results vary by quantification method
    • dPF-based analysis reveals enriched terms that are missed by traditional protein-level approaches

Conclusions:

The choice of protein quantification method has substantial impact on biological interpretation. ProteoForge's dPF-based approach provides additional resolution that can reveal regulatory complexity hidden in traditional protein-level analyses.

Output Files:

  • QuEStVar results: *_QuEStVar.feather files for each method/timepoint
  • Enrichment results: gProfiler_Enrichment_Analysis_AllSets.csv
  • Figures: Saved to figures/hypoxia/04-Downstream/
print("Notebook Execution Time:", utils.prettyTimer(utils.getTime() - startTime))
Notebook Execution Time: 00h:01m:15s